Ship shared 1024-2048 high cell (V2); sync docs + model

Replace the two dedicated high cells (weights1024.pt, weights2048.pt) with one
shared carry-aware-TCN cell across tiers 9-10 (weights_shared_1024_2048.pt):
distilled to both dedicated teachers + worst-bit margin, 2048 chain preserved so
the primary key (highest_tier_above_90=10) is held. tier 9 = 1.00, tier 10 = 1.00,
overall_accuracy 1.000; two shared weight files total (~10.7M params, 0.04 GB).

model.py: _build_cell drops non-constructor keys (unified/widths).

Files changed (5) hide show

README.md +59 -18
manifest.json +2 -2
model.py +11 -6
weights2048.pt +0 -3
weights1024.pt → weights_shared_1024_2048.pt +2 -2

README.md CHANGED Viewed

@@ -17,9 +17,9 @@ metrics:
 # horner_rnn
 A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
-2^2048) on the public benchmark — tiers 1-8 = 100%, tier 9 = 99%, **tier 10 = 100%** —
-so `highest_tier_above_90 = 10` (the maximum), overall_accuracy **0.999**. Every cell is
-the same **carry-aware TCN** (~15.4M params total across three weight files, 0.06 GB), so its capability comes from *learning one algorithmic step* rather
 than memorising finite multiplication tables, and it verifiably generalises to primes never seen
 in training.
@@ -53,18 +53,18 @@ The recurrence is exact only if the state is wide enough to hold the residue, so
 trained per bit-width — but because the dilated convolution is weight-shared across bit-positions
 and the carry/borrow rule is position-invariant, **one shared weight-set serves all small/mid
 widths 16/32/64/128/256/512** (run at each prime's native width). The model therefore ships
-**three weight files** and routes each problem to the narrowest cell whose state holds its prime:
 | Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
 |---|---|---|---|---|---|
 | `weights_shared_16_512.pt` | `< 2^512` | 1-8 | carry-aware TCN, 14 blocks, dil 1..256 — **one shared set**, run at native width | ~5.5M | tiers 1-8 = 1.00 |
-| `weights1024.pt` | `< 2^1024` | 9 | carry-aware TCN, 12 blocks, dil 1..512 | ~4.7M | tier 9 = 0.99 |
-| `weights2048.pt` | `< 2^2048` | 10 | carry-aware TCN, 13 blocks, dil 1..1024 | ~5.1M | tier 10 = 1.00 |
 The earlier four separate mid-width cells had already collapsed into one shared 64–512 set;
-this version further merges the 16- and 32-bit small-prime cells into that same shared block-pool.
-The final shared 16–512 set reaches tiers 1–8 = 1.00 and cuts the total to **~15.4M params,
-0.06 GB**. For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]`
 fallback without invoking the network.
 ## The carry-aware TCN (every tier)
@@ -83,7 +83,7 @@ The two small cells were originally width-4096/6144 MLPs (660 MB combined); repl
 the carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range),
 shrank the artifact from 0.77 GB to ~0.13 GB (the later mid-cell collapse then brought the total
 to **0.08 GB**), raised tier 4 from 0.99 to **1.00**, and made
-the small-prime tiers width-robust before the final 16–512 merge cut the artifact to **0.06 GB**.
 A TCN trained near-max-width only has a short-prime blind spot (see the audit note below), which
 the width-matched training removes.
@@ -222,17 +222,55 @@ soup 1.00 ≥ 0.99 on the same draw, tiers 1-9 byte-identical. Full recipe and f
 OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_prime` is
 ~227 ms/prime at 2048-bit). Validate with `python exploration/score_tier10.py <ckpt>`.
 ## Score (public benchmark, fixed seed)
 | Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
 |---|---|---|---|
-| **1100** | **0.999** | **10** (max) | True |
-Per-tier at total=1100: tiers 1–8 **1.00**, tier 9 **0.99**, tier 10 **1.00**
 (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication, primes near each
-width's maximum — a separate regime, not in overall_accuracy) is **0.64** on this fixed public
-seed. Inference for all 1100 problems is 171s, within the 300s budget (the 2048-step tier-10 scan
-is the bulk); artifact 0.06 GB.
 ## Status under the rules
@@ -271,8 +309,11 @@ faithful 5-prime bootstrap plus a fixed-seed end-to-end A/B, no regression on an
 Earlier this round the thin small/mid tiers were re-polished with the width-matched,
 worst-bit-margin recipe and then collapsed into the shared 16–512 soup — **tier 8 0.92 → 1.00**
 public, with matched faithful bootstrap E[acc] 0.9866 → 0.9931 and `P(tier8 < 0.95)` 1.396% →
-0.205%. Tier 10 independently improved **0.94 → 0.98 → 1.00**. `overall_accuracy` is now **0.999**
-with tiers 1–8 all at 1.00 and the lowest scored tier tier 9 = 0.99. Tier 0 (pure multiplication,
 primes near each width's maximum) is excluded from `overall_accuracy`, so it moves neither ranking
 key. Both ranking keys are saturated; remaining gains are sub-percent.
@@ -282,7 +323,7 @@ of each tier's bit-range), a cell trained near-max-width only can score ~0 on sh
 still look perfect on the public set — exactly the gap that capped tier 9 before it was
 width-matched. Every shipped cell is now trained width-matched (value-uniform **plus** a
 bit-length-uniform band): the shared 16–512 cell on the full {16,32,64,128,256,512} mix,
-and the 1024/2048 cells across their own ranges. Re-auditing the shared-cell model on 40k
 secret-style draws found **P(tier < 0.90) ≈ 0.000%** — the shared 16–512 cell (tiers 1–8) shows
 no width knee, and tiers 9/10 are blind only in the *deep* value-uniform tail (knees ~970-bit /
 ~1950-bit), which carries ≈2⁻⁵⁴ / 2⁻⁹⁸ of the draw mass and is effectively unsamplable. No

 # horner_rnn
 A compliant bit-sequential RNN that **clears every reduction tier, 1 through 10** (primes up to
+2^2048) on the public benchmark — **tiers 1-10 = 100%** —
+so `highest_tier_above_90 = 10` (the maximum), overall_accuracy **1.000**. Every cell is
+the same **carry-aware TCN** (~10.7M params total across two shared weight files, 0.04 GB), so its capability comes from *learning one algorithmic step* rather
 than memorising finite multiplication tables, and it verifiably generalises to primes never seen
 in training.
 trained per bit-width — but because the dilated convolution is weight-shared across bit-positions
 and the carry/borrow rule is position-invariant, **one shared weight-set serves all small/mid
 widths 16/32/64/128/256/512** (run at each prime's native width). The model therefore ships
+**two shared weight files** and routes each problem to the narrowest cell whose state holds its prime:
 | Weight file | Primes | Tiers | Architecture | Params | Public benchmark |
 |---|---|---|---|---|---|
 | `weights_shared_16_512.pt` | `< 2^512` | 1-8 | carry-aware TCN, 14 blocks, dil 1..256 — **one shared set**, run at native width | ~5.5M | tiers 1-8 = 1.00 |
+| `weights_shared_1024_2048.pt` | `< 2^2048` | 9-10 | carry-aware TCN, 13 blocks, dil 1..1024 — **one shared high-width set**, run at native width | ~5.1M | tier 9 = 1.00, tier 10 = 1.00 |
 The earlier four separate mid-width cells had already collapsed into one shared 64–512 set;
+this version further merges the 16- and 32-bit small-prime cells into that same shared block-pool,
+and merges the 1024/2048 cells into a second high-width shared block-pool. The final two shared
+sets reach tiers 1–10 = 1.00 and cut the total to **~10.7M
+params, 0.04 GB**. For `p >= 2^2048` (outside all regimes) the model emits the honest `[0]`
 fallback without invoking the network.
 ## The carry-aware TCN (every tier)
 the carry-aware TCN, trained width-matched (bit-length-uniform over the cell's whole range),
 shrank the artifact from 0.77 GB to ~0.13 GB (the later mid-cell collapse then brought the total
 to **0.08 GB**), raised tier 4 from 0.99 to **1.00**, and made
+the small-prime tiers width-robust before the final 16–512 and 1024–2048 merges cut the artifact to **0.04 GB**.
 A TCN trained near-max-width only has a short-prime blind spot (see the audit note below), which
 the width-matched training removes.
 OOMs otherwise) and disk-cached prime pools (`--build-pools-only`; gmpy2 `next_prime` is
 ~227 ms/prime at 2048-bit). Validate with `python exploration/score_tier10.py <ckpt>`.
+### High-width shared set (1024 + 2048)
+The final shrink step shares one 13-block high-width TCN across tiers 9 and 10. Directly running
+the 2048 cell at native 1024 width already scored tier 9 = 0.94, so the route was a bounded
+1024+2048 polish from the public-correct 2048 cell, not extending the 16–512 cell upward.
+The decisive lever is **distillation to the two dedicated teachers** plus a worst-bit margin loss:
+warm-start from the dedicated 2048 cell, train jointly at widths 1024/2048, and distill the
+1024-width logits toward the strong dedicated 1024 teacher (which transfers its 1024 chain
+robustness) and the 2048-width logits toward the dedicated 2048 teacher (which holds the tier-10
+primary key). A 2048 chain-preservation floor guards the primary key — no checkpoint that erodes
+the 2048 chain can be saved. This makes one shared cell match *both* dedicated cells at their own
+widths, with no model-soup needed:
+```bash
+# shared high cell: distill to both dedicated teachers + worst-bit margin, 2048 preserved
+python exploration/train_unified.py --warm \
+    --init-from checkpoints/weights2048_ship_shared16_prev.pt \
+    --widths 1024,2048 --width-weights 1024:0.7,2048:0.3 \
+    --blocks 13 --max-dil 1024 --grad-checkpoint --max-rows 512 --accum 8 \
+    --bitlen-frac 0.5 --lr 3e-5 --stage-a 0.08 --stage-c 0.12 \
+    --margin-weight 0.5 --margin-m1 12.0 \
+    --distill-weight 0.15 \
+    --distill-map 1024:checkpoints/weights1024_ship_shared16_prev.pt,2048:checkpoints/weights2048_ship_shared16_prev.pt \
+    --preserve-widths 2048 --preserve-chain 0.98 \
+    --out checkpoints/shared_high_v2_s1.pt
+# package the .final cell (clean config + top-level widths=[1024,2048]):
+python exploration/package_shared_high.py checkpoints/shared_high_v2_s1.pt.final \
+    <prev weights_shared_1024_2048.pt as config template> horner_rnn/weights_shared_1024_2048.pt
+```
+An earlier attempt that pinned the public-correct 2048 cell in by model-soup (0.70·old-2048 +
+0.30·pilot) held tier 10 but **regressed tier 9** under a faithful 5-prime bootstrap (E 0.968,
+worst-prime 0.80) because the old-2048 cell is only ~0.94 at native 1024 — so the soup route was
+dropped in favour of the distill+margin cell above. Gate (`diag_5prime_boot`, pool 100, seed 991):
+tier 9 E[acc] 0.9939 / worst-prime 0.933 (≈ the dedicated cell), tier 10 E[acc] 0.9913 /
+P(acc<0.90) 0.002% / worst-prime 0.933 (primary key held). Public benchmark: tiers 9 and 10 = 1.00.
 ## Score (public benchmark, fixed seed)
 | Total problems | overall_accuracy | highest_tier_above_90 | deterministic |
 |---|---|---|---|
+| **1100** | **1.000** | **10** (max) | True |
+Per-tier at total=1100: tiers 1–10 all **1.00**
 (overall_accuracy is the mean over tiers 1-10). Tier 0 (pure multiplication, primes near each
+width's maximum — a separate regime, not in overall_accuracy) is **0.70** on this fixed public
+seed. Inference for all 1100 problems is ~174s, within the 300s budget (the 2048-step tier-10 scan
+is the bulk); artifact 0.04 GB.
 ## Status under the rules
 Earlier this round the thin small/mid tiers were re-polished with the width-matched,
 worst-bit-margin recipe and then collapsed into the shared 16–512 soup — **tier 8 0.92 → 1.00**
 public, with matched faithful bootstrap E[acc] 0.9866 → 0.9931 and `P(tier8 < 0.95)` 1.396% →
+0.205%. Tier 10 independently improved **0.94 → 0.98 → 1.00**, then the 1024/2048 cells were collapsed
+into one high-width shared cell (distilled to both dedicated teachers + worst-bit margin) that
+matches both dedicated cells at their own widths — tier 9 recovered to public **1.00** (faithful
+E[acc] 0.9939) while public tier 10 stayed 1.00 (faithful P(acc<0.90) 0.002%). `overall_accuracy`
+is now **1.000** with tiers 1–10 all at 1.00. Tier 0 (pure multiplication,
 primes near each width's maximum) is excluded from `overall_accuracy`, so it moves neither ranking
 key. Both ranking keys are saturated; remaining gains are sub-percent.
 still look perfect on the public set — exactly the gap that capped tier 9 before it was
 width-matched. Every shipped cell is now trained width-matched (value-uniform **plus** a
 bit-length-uniform band): the shared 16–512 cell on the full {16,32,64,128,256,512} mix,
+and the shared 1024–2048 cell across the high-width ranges. Re-auditing the shared-cell model on 40k
 secret-style draws found **P(tier < 0.90) ≈ 0.000%** — the shared 16–512 cell (tiers 1–8) shows
 no width knee, and tiers 9/10 are blind only in the *deep* value-uniform tail (knees ~970-bit /
 ~1950-bit), which carries ≈2⁻⁵⁴ / 2⁻⁹⁸ of the draw mass and is effectively unsamplable. No

manifest.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "entry_class": "model.HornerRNN",
   "output_base": 2,
   "framework": "pytorch",
-  "model_description": "Bit-sequential RNN (~15.4M params across three weight files) for modular multiplication with primes up to 2^2048. The model reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary. Its hidden state is a hard-quantized bit vector, and the transition function is a learned carry-aware dilated-convolution TCN trained to implement the Horner step (t, bit, b, p) -> (2*t + bit*b) mod p. The final hidden state bits are emitted MSB-first as the base-2 answer. Routing uses the narrowest state width that can hold p. A SINGLE shared TCN weight-set, weights_shared_16_512.pt, serves 16/32/64/128/256/512-bit states (tiers 1-8) at each prime's native width and reaches tiers 1-8 = 1.00 on the public benchmark. Dedicated TCN cells weights1024.pt and weights2048.pt cover tier 9 and tier 10, reaching tier 9 = 0.99 and tier 10 = 1.00. For p >= 2^2048 the model emits the honest [0] fallback without invoking the network. The arithmetic is not in Python code: tokenization, scan, threshold, and readout are architecture, while doubling, conditional add, compare/borrow, and reduction are all learned in the trained cell weights; random or perturbed weights collapse to the floor.",
-  "training_description": "Each cell is trained from exact single-step labels (t, bit, b, p) -> (2*t + bit*b) mod p, with BCE per state bit, AdamW, cosine decay, gradient clipping, EMA checkpointing, and held-out-prime validation. Training data uses true Horner-trajectory states plus boundary-focused examples; prime sampling is value-uniform to match the challenge generator, with bit-length-uniform bands where needed so the reduction boundary is seen at every position. The current shared 16-512 file was built by warm-starting from the shipped shared 64-512 carry-aware TCN (14 residual TCN blocks, 256 channels, dilations cycling through 1..256), fine-tuning on a {16,32,64,128,256,512} width mix, then averaging that run with a small-tier polish tail: soup25 = 0.75 * unified_16to512_warm_s0.final + 0.25 * unified_16to512_smalltail_s1.final. The warm run used a 16-heavy but 512-preserving distribution (16:.40,32:.18,64:.08,128:.08,256:.08,512:.18), accum=8, lr=2e-4, off-trajectory batches (offtraj-frac=.20,k=4), and width-native validation. The polish tail warm-started from that candidate with lower lr=6e-5 and weights 16:.55,32:.25,64:.04,128:.04,256:.04,512:.08, then the soup was selected by public score plus matched 5-prime bootstrap. On the fixed public benchmark this merge lifts the shipped model from overall 0.997 to 0.999: tiers 1-8 = 1.00, tier 9 = 0.99, tier 10 = 1.00. A matched faithful bootstrap over tiers 1-8 (5 primes/tier structure, pool120, k30, boot200k, seed 515151) ties tiers 1-2 and improves tiers 3-8; tier 8 E[acc] improves 0.9866 -> 0.9931 and P(tier8<0.95) drops 1.396% -> 0.205%. The 1024-bit cell was trained separately with benchmark-width-matched primes in [2^513,2^1024), gradient accumulation, and worst-bit margin loss; it remains byte-unchanged and scores tier 9 = 0.99. The 2048-bit cell was bootstrapped from the 1024-bit cell by octave transfer, hardened with low-lr margin tails, then weight-souped across independent margin tails; it remains byte-unchanged and scores tier 10 = 1.00 while reducing faithful 5-prime tier-10 tail risk. Full all-width single-cell unification including 1024/2048 was tested and rejected because one ~5M cell could not preserve 2048-chain robustness while serving small/mid widths; the shipped design intentionally keeps dedicated 1024 and 2048 cells. Compliance checks: preprocess hooks are identity, the legal two-operand reductions a%p and b%p are used only for input normalization, perturbing trained weights collapses accuracy toward the untrained floor, and held-out-prime generalization tracks train accuracy."
 }

   "entry_class": "model.HornerRNN",
   "output_base": 2,
   "framework": "pytorch",
+  "model_description": "Bit-sequential RNN (~10.7M params across two shared carry-aware TCN weight-sets) for modular multiplication with primes up to 2^2048. The model reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary. Its hidden state is a hard-quantized bit vector, and the transition function is a learned carry-aware dilated-convolution TCN trained to implement the Horner step (t, bit, b, p) -> (2*t + bit*b) mod p. The final hidden state bits are emitted MSB-first as the base-2 answer. Routing uses the narrowest state width that can hold p. A shared TCN weight-set, weights_shared_16_512.pt, serves 16/32/64/128/256/512-bit states (tiers 1-8) at each prime's native width and reaches tiers 1-8 = 1.00 on the public benchmark. A second shared TCN weight-set, weights_shared_1024_2048.pt, serves 1024/2048-bit states (tiers 9-10) at native width and reaches tiers 9 and 10 = 1.00 on the public benchmark. For p >= 2^2048 the model emits the honest [0] fallback without invoking the network. The arithmetic is not in Python code: tokenization, scan, threshold, and readout are architecture, while doubling, conditional add, compare/borrow, and reduction are all learned in the trained cell weights; random or perturbed weights collapse to the floor.",
+  "training_description": "Each cell is trained from exact single-step labels (t, bit, b, p) -> (2*t + bit*b) mod p, with BCE per state bit, AdamW, cosine decay, gradient clipping, EMA checkpointing, and held-out-prime validation. Training data uses true Horner-trajectory states plus boundary-focused examples; prime sampling is value-uniform to match the challenge generator, with bit-length-uniform bands where needed so the reduction boundary is seen at every position. The shared 16-512 file was built by warm-starting from the shipped shared 64-512 carry-aware TCN (14 residual TCN blocks, 256 channels, dilations cycling through 1..256), fine-tuning on a {16,32,64,128,256,512} width mix, then averaging that run with a small-tier polish tail: soup25 = 0.75 * unified_16to512_warm_s0.final + 0.25 * unified_16to512_smalltail_s1.final. On the fixed public benchmark this merge brought the small/mid tiers 1-8 to 1.00. A matched faithful bootstrap over tiers 1-8 (5 primes/tier structure, pool120, k30, boot200k, seed 515151) ties tiers 1-2 and improves tiers 3-8; tier 8 E[acc] improves 0.9866 -> 0.9931 and P(tier8<0.95) drops 1.396% -> 0.205%. The shared 1024-2048 file was built by warm-starting from the public-correct 2048-bit TCN (13 blocks, max_dil 1024) and training jointly at widths 1024 and 2048 with logit-distillation to BOTH dedicated teachers (the 1024-width logits toward the strong dedicated 1024 cell, which transfers its 1024 chain robustness; the 2048-width logits toward the dedicated 2048 cell, which holds the tier-10 primary key) plus a worst-bit margin loss, under a 2048 chain-preservation floor so no tier-10-eroding checkpoint can be saved. This makes one shared cell match both dedicated cells at their own widths without any model-soup. An earlier soup route (0.70 * old weights2048 + 0.30 * a 1024/2048 pilot) held tier 10 but regressed tier 9 under a faithful 5-prime bootstrap (E 0.968, worst-prime 0.80, because the old 2048 cell is only ~0.94 at native 1024), so it was dropped. Faithful gate (diag_5prime_boot, pool 100, seed 991): tier 9 E[acc] 0.9939 / worst-prime 0.933 (matching the dedicated 1024 cell), tier 10 E[acc] 0.9913 / P(acc<0.90) 0.002% / worst-prime 0.933 (primary key held). Public benchmark: overall_accuracy 1.00, tiers 1-10 all 1.00, highest_tier_above_90 = 10, deterministic. Full all-width single-cell unification across 16..2048 was tested and rejected because one ~5M cell could not preserve 2048-chain robustness while serving small/mid widths; the shipped design intentionally keeps two adjacent shared groups. Compliance checks: preprocess hooks are identity, the legal two-operand reductions a%p and b%p are used only for input normalization, perturbing trained weights collapses accuracy toward the untrained floor, and held-out-prime generalization tracks train accuracy."
 }

model.py CHANGED Viewed

@@ -32,9 +32,9 @@ the same legal input normalisation every other reference model uses.
 Routing: each problem goes to the narrowest cell whose state holds the prime.
 A SINGLE shared carry-aware TCN weight-set covers 16/32/64/128/256/512-bit
-primes (tiers 1-8), run at each prime's native width; dedicated TCN cells cover
-1024 (tier 9) and 2048 (tier 10). For primes wider than the widest trained cell
-it emits the honest ``[0]`` fallback without invoking the network.
 """
 from __future__ import annotations
@@ -171,6 +171,11 @@ class TCNHornerCell(nn.Module):
 def _build_cell(config: dict):
     """Instantiate the cell class named by config['arch'] (default = MLP HornerCell)."""
     cfg = dict(config)
     if cfg.get("arch") == "tcn":
         cfg.pop("arch", None)
         return TCNHornerCell(**cfg)
@@ -233,9 +238,9 @@ class HornerRNN(ModularMultiplicationModel):
         md = Path(model_dir)
         # Shared multi-width cells: ONE weight-set serving several adjacent widths
-        # (config-declared `widths`). The 16-512 carry-aware TCN ships this way — the
-        # same trained weights run at each prime's native width (see TCNHornerCell.forward),
-        # matching/beating the prior small/mid cells it replaces.
         for shared in sorted(md.glob("weights_shared_*.pt")):
             ckpt = torch.load(shared, map_location=self.device, weights_only=True)
             cell = _build_cell(ckpt.get("config", {}))

 Routing: each problem goes to the narrowest cell whose state holds the prime.
 A SINGLE shared carry-aware TCN weight-set covers 16/32/64/128/256/512-bit
+primes (tiers 1-8), and a second shared TCN weight-set covers 1024/2048-bit
+primes (tiers 9-10), both run at each prime's native width. For primes wider than
+the widest trained cell it emits the honest ``[0]`` fallback without invoking the network.
 """
 from __future__ import annotations
 def _build_cell(config: dict):
     """Instantiate the cell class named by config['arch'] (default = MLP HornerCell)."""
     cfg = dict(config)
+    # Tolerate non-constructor metadata that shared/training checkpoints may carry:
+    # `unified` is a training-only marker and `widths` (the shared-set width list)
+    # lives as a top-level checkpoint key, not a cell-constructor argument.
+    cfg.pop("unified", None)
+    cfg.pop("widths", None)
     if cfg.get("arch") == "tcn":
         cfg.pop("arch", None)
         return TCNHornerCell(**cfg)
         md = Path(model_dir)
         # Shared multi-width cells: ONE weight-set serving several adjacent widths
+        # (config-declared `widths`). The 16-512 and 1024-2048 carry-aware TCNs
+        # ship this way — the same trained weights run at each prime's native width
+        # (see TCNHornerCell.forward), matching/beating the cells they replace.
         for shared in sorted(md.glob("weights_shared_*.pt")):
             ckpt = torch.load(shared, map_location=self.device, weights_only=True)
             cell = _build_cell(ckpt.get("config", {}))

weights2048.pt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6e8e3e38e1eb284917c48587cb7f1c6858940ed82516fcb98880d1a6c9668969
-size 20531981

weights1024.pt → weights_shared_1024_2048.pt RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b21efe3e2094458cb770d38d433f9ac7a23293d66df8dfff79a57142d99005db
-size 18957689

 version https://git-lfs.github.com/spec/v1
+oid sha256:0b9dd97ce05468d7b50ed8a1453bb6f5da18315d341d31e23f96b5283c338691
+size 20533517