Tier-7 256-bit carry-aware TCN cell: highest_tier 6->7, overall 0.602->0.698, tier-7 0.98

Routes p<2^256 to a new 256-bit TCN cell (12 blocks/256ch/dilations 1-128,
~4.7M params; CELL_WIDTHS now 16/32/64/128/256). Full public benchmark:
tier 7 = 0.98, overall 0.698, highest_tier_above_90 = 7, deterministic.
Compliance: 0.98@s0 -> 0.06@s0.25 -> 0.02@s0.5, untrained 0.00.
README + manifest updated to five cells / up to 2^256.

Files changed (3) hide show

README.md +30 -21
manifest.json +2 -2
model.py +8 -8

README.md CHANGED Viewed

@@ -9,17 +9,18 @@ tags:
   - neural-algorithm
 ---
-# Horner-RNN — learned modular multiplication up to 2¹²⁸
 A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to
-**2¹²⁸**, by *learning the Horner step of double-and-add* rather than memorising
 multiplication tables. Entry for the
 [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
-- **Saturates tiers 1–6** (all primes `< 2¹²⁸`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%, **tier 6 = 97%**
-- **overall_accuracy 0.602**, `highest_tier_above_90 = 6`
-- The 128-bit (tier-6) cell is a **carry-aware TCN** (weight-shared dilated convolutions over the
-  128 bit-positions, ~3.9M params) — a far better inductive bias for long carry chains than the MLP
 - Verifiably **generalises to primes never seen in training** (held-out-prime validation
   accuracy tracks training accuracy — no memorisation gap)
@@ -33,8 +34,9 @@ t_{k+1} = (2·t_k + a_bit_k · b) mod p      # one learned step (Horner)
 answer  = t_N           (N = bit width of p)
 ```
-The model is an RNN whose **transition function — an MLP — is trained on exactly that
-single-step map** over binary-encoded inputs. The hidden state is a quantized bit vector
 (a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per
 step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one
 per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted
@@ -46,7 +48,7 @@ The single-step function is **piecewise linear** (`2t + bit·b`, then subtract `
 ## Files / cells
-The model ships **four cells** and routes each problem to the narrowest one whose state
 holds the prime:
 | File | Cell | Primes | Tiers | Arch | Params | Public benchmark |
@@ -55,22 +57,25 @@ holds the prime:
 | `weights32.pt` | 32-bit | `< 2³²` | 4 | MLP, 6144 / 4 | ~114M | tier 4 = 0.99 |
 | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | MLP, 4096 / 7, residual | ~236M | tier 5 = 0.98 |
 | `weights128.pt` | 128-bit | `< 2¹²⁸` | 6 | **carry-aware TCN**, 256ch / 10 blocks, dilations 1–64 | ~3.9M | **tier 6 = 0.97** |
-The 128-bit cell switches architecture: instead of a full-width MLP it is a **non-causal
-dilated 1-D convolutional network over the 128 bit-positions**. Carry propagation is
-*position-invariant* — the same carry/borrow rule applies at every bit — so a weight-shared
-convolution learns **one** rule applied everywhere (non-causal, so the addition carry flows
-LSB→MSB and the mod-`p` compare/borrow flows MSB→LSB), rather than an MLP learning 128 separate
-position-functions. This inductive bias drives the per-step error roughly **15× lower** than the
-same-task MLP, lifting tier 6 from 0.26 to **0.97** with a cell **~60× smaller** (16 MB vs ~950 MB).
 The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
 modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
 compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
 The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
 examples on the **states the chain actually visits** (the true Horner trajectory) rather than
-uniformly sampled `t` — see *Training*. For `p ≥ 2⁶⁴` the model emits the honest `[0]`
-fallback without invoking the network.
 Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
 (challenge manifest), `train.py` (the 16-bit trainer).
@@ -91,7 +96,7 @@ import torch
 from model import HornerRNN          # model.py from this repo
 m = HornerRNN()
-m.load(".")                          # auto-loads weights{16,32,64}.pt from this dir
 # returns base-2 digits, MSB-first; the harness decodes them to the integer
 digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0]
 answer = int("".join(map(str, digits)), 2)
@@ -121,6 +126,7 @@ cell is *at* the floor. The capability therefore resides in the trained paramete
 | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
 | tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
 | tier 6 (128-bit TCN) | 0.97 | 0.96 | 0.98 | 0.19 | 0.02 | 0.00 |
 Generalisation against memorisation: 10% of primes at each bit-width were held out of
 training entirely; chain accuracy on them matches the training primes.
@@ -137,7 +143,10 @@ the recurrence (ordinary supervised BCE on the same single-step target). The 128
 cell is trained the same single-step way but as the **carry-aware TCN** over a high-diversity
 pool of thousands of distinct 124–128 bit primes; its weight-shared dilated-convolution bias
 reaches a per-step error ~15× lower than the same-task MLP, giving **tier 6 = 0.97** in a single
-short run. Training code and the full write-up live in the solutions repo (link in the model card
 metadata / challenge leaderboard).
 ## License

   - neural-algorithm
 ---
+# Horner-RNN — learned modular multiplication up to 2²⁵⁶
 A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to
+**2²⁵⁶**, by *learning the Horner step of double-and-add* rather than memorising
 multiplication tables. Entry for the
 [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
+- **Saturates tiers 1–7** (all primes `< 2²⁵⁶`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%, tier 6 = 97%, **tier 7 = 98%**
+- **overall_accuracy 0.698**, `highest_tier_above_90 = 7`
+- The 128-bit (tier-6) and 256-bit (tier-7) cells are **carry-aware TCNs** (weight-shared dilated
+  convolutions over the bit-positions, ~4–5M params each) — a far better inductive bias for long
+  carry chains than the MLP, and the key to the per-step precision a 128/256-step chain demands
 - Verifiably **generalises to primes never seen in training** (held-out-prime validation
   accuracy tracks training accuracy — no memorisation gap)
 answer  = t_N           (N = bit width of p)
 ```
+The model is an RNN whose **transition function — an MLP for the 16/32/64-bit cells, a
+carry-aware TCN for the 128/256-bit cells — is trained on exactly that single-step map**
+over binary-encoded inputs. The hidden state is a quantized bit vector
 (a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per
 step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one
 per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted
 ## Files / cells
+The model ships **five cells** and routes each problem to the narrowest one whose state
 holds the prime:
 | File | Cell | Primes | Tiers | Arch | Params | Public benchmark |
 | `weights32.pt` | 32-bit | `< 2³²` | 4 | MLP, 6144 / 4 | ~114M | tier 4 = 0.99 |
 | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | MLP, 4096 / 7, residual | ~236M | tier 5 = 0.98 |
 | `weights128.pt` | 128-bit | `< 2¹²⁸` | 6 | **carry-aware TCN**, 256ch / 10 blocks, dilations 1–64 | ~3.9M | **tier 6 = 0.97** |
+| `weights256.pt` | 256-bit | `< 2²⁵⁶` | 7 | **carry-aware TCN**, 256ch / 12 blocks, dilations 1–128 | ~4.7M | **tier 7 = 0.98** |
+The 128- and 256-bit cells switch architecture: instead of a full-width MLP each is a **non-causal
+dilated 1-D convolutional network over the bit-positions** (128 and 256 respectively). Carry
+propagation is *position-invariant* — the same carry/borrow rule applies at every bit — so a
+weight-shared convolution learns **one** rule applied everywhere (non-causal, so the addition carry
+flows LSB→MSB and the mod-`p` compare/borrow flows MSB→LSB), rather than an MLP learning a separate
+position-function per bit. This inductive bias drives the per-step error roughly **15× lower** than the
+same-task MLP — the difference between a 128/256-step chain landing at ~0.26 and at **0.97 / 0.98** —
+in cells **~60× smaller** than the wide MLPs (16–19 MB each vs ~950 MB). The receptive field of each
+TCN spans its full width in both carry directions, so a carry can propagate across the entire word.
 The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
 modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
 compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
 The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
 examples on the **states the chain actually visits** (the true Horner trajectory) rather than
+uniformly sampled `t` — see *Training*. For `p ≥ 2²⁵⁶` (wider than the widest trained cell) the
+model emits the honest `[0]` fallback without invoking the network.
 Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
 (challenge manifest), `train.py` (the 16-bit trainer).
 from model import HornerRNN          # model.py from this repo
 m = HornerRNN()
+m.load(".")                          # auto-loads weights{16,32,64,128,256}.pt from this dir
 # returns base-2 digits, MSB-first; the harness decodes them to the integer
 digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0]
 answer = int("".join(map(str, digits)), 2)
 | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
 | tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
 | tier 6 (128-bit TCN) | 0.97 | 0.96 | 0.98 | 0.19 | 0.02 | 0.00 |
+| tier 7 (256-bit TCN) | 0.98 | 0.97 | 0.99 | 0.06 | 0.02 | 0.00 |
 Generalisation against memorisation: 10% of primes at each bit-width were held out of
 training entirely; chain accuracy on them matches the training primes.
 cell is trained the same single-step way but as the **carry-aware TCN** over a high-diversity
 pool of thousands of distinct 124–128 bit primes; its weight-shared dilated-convolution bias
 reaches a per-step error ~15× lower than the same-task MLP, giving **tier 6 = 0.97** in a single
+short run. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions
+(dilations cycling 1–128), trained identically on true-trajectory single steps over distinct
+252–256 bit primes; its per-step error is low enough that the 256-step chain holds at **tier 7 =
+0.98**. Training code and the full write-up live in the solutions repo (link in the model card
 metadata / challenge leaderboard).
 ## License

manifest.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "entry_class": "model.HornerRNN",
   "output_base": 2,
   "framework": "pytorch",
-  "model_description": "Bit-sequential RNN (~405M params across four cells) for primes up to 2^128. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Four cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell (MLP, width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5, and a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params). The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning 128 separate position-functions; this inductive bias drives the per-step error far below what the MLP cell reached and lifts tier 6 to 0.97. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^128 emits the honest [0] fallback without invoking the network.",
-  "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way — single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p — from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. Weight-perturbation compliance (exploration/compliance_perturb.py): tier-6 accuracy 0.97 at sigma=0 collapses toward the floor as the conv weights are perturbed (0.19 at sigma=0.25, 0.02 at sigma=0.5) and an untrained cell scores 0.00 — the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN)."
 }

   "entry_class": "model.HornerRNN",
   "output_base": 2,
   "framework": "pytorch",
+  "model_description": "Bit-sequential RNN (~410M params across five cells) for primes up to 2^256. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Five cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell (MLP, width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5, a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params), and a 256-bit cell for p < 2^256 covering tier 7 that uses the SAME carry-aware TCN architecture scaled to 256 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..128 so the receptive field spans all 256 bits, ~4.7M params), lifting tier 7 to 0.98. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128- and 256-bit chains (which compound the per-step error over 128 and 256 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^256 emits the honest [0] fallback without invoking the network.",
+  "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way — single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p — from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions (dilations cycling 1..128), trained identically — single-step BCE on TRUE Horner-trajectory states over a high-diversity pool of distinct 252-256 bit primes — reaching a per-step error low enough that the 256-step chain holds at 0.98 full-chain accuracy on held-out 252-256 bit primes. Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy at sigma=0 collapses toward the floor as the weights are perturbed and an untrained re-init scores 0.00 — e.g. tier 6 0.97 -> 0.19 (sigma=0.25) -> 0.02 (sigma=0.5) and tier 7 0.98 -> 0.06 (sigma=0.25) -> 0.02 (sigma=0.5), untrained 0.00 for both — so the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN), exploration/train_horner_tcn.py --bits 256 (256-bit carry-aware TCN)."
 }

model.py CHANGED Viewed

@@ -5,8 +5,8 @@ one per step, conditioned on ``(b mod p, p)`` in binary. The hidden state is a
 quantized bit vector (a discrete bottleneck — a hard VQ layer with a fixed
 binary codebook), and the transition function — an MLP for the 16/32/64-bit
 cells, a weight-shared carry-aware dilated-conv TCN (TCNHornerCell) for the
-128-bit cell — is entirely trained parameters. After the last bit, the hidden
-state bits ARE the answer, emitted MSB-first in base 2.
 Why this is interesting: for the recurrence to end on the right answer, the
 trained cell must *learn* the map ``(t, bit, b, p) -> (2t + bit*b) mod p`` —
@@ -24,16 +24,16 @@ line is respected here:
     random weights the output is noise (Principle 2), and the emitted digits are
     exactly the model's final hidden state (Principle 1).
   - learned (all of the actual arithmetic): the transition function. Nothing in
-    the code adds, multiplies, compares against p, or carries; the cell's MLP
-    weights had to learn all of that from data.
 The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
 the same legal input normalisation every other reference model uses.
 The model ships one cell per bit-width (16 -> tiers 1-3, 32 -> tier 4, 64 ->
-tier 5, and 128 -> tier 6 when present) and routes each problem to the narrowest
-cell whose state holds the prime. For primes wider than the widest trained cell
-it emits the honest ``[0]`` fallback without invoking the network.
 """
 from __future__ import annotations
@@ -48,7 +48,7 @@ from modchallenge.interface.base_model import ModularMultiplicationModel
 # Bit-widths we may ship a cell for, narrowest first. load() picks up whichever
 # weights{W}.pt files are actually present, so adding a wider cell is drop-in.
-CELL_WIDTHS = (16, 32, 64, 128)
 # Default state width for the 16-bit trainer (train.py imports this).
 BITS = 16

 quantized bit vector (a discrete bottleneck — a hard VQ layer with a fixed
 binary codebook), and the transition function — an MLP for the 16/32/64-bit
 cells, a weight-shared carry-aware dilated-conv TCN (TCNHornerCell) for the
+128- and 256-bit cells — is entirely trained parameters. After the last bit,
+the hidden state bits ARE the answer, emitted MSB-first in base 2.
 Why this is interesting: for the recurrence to end on the right answer, the
 trained cell must *learn* the map ``(t, bit, b, p) -> (2t + bit*b) mod p`` —
     random weights the output is noise (Principle 2), and the emitted digits are
     exactly the model's final hidden state (Principle 1).
   - learned (all of the actual arithmetic): the transition function. Nothing in
+    the code adds, multiplies, compares against p, or carries; the cell's trained
+    weights (MLP or carry-aware TCN) had to learn all of that from data.
 The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
 the same legal input normalisation every other reference model uses.
 The model ships one cell per bit-width (16 -> tiers 1-3, 32 -> tier 4, 64 ->
+tier 5, 128 -> tier 6, and 256 -> tier 7 when present) and routes each problem to
+the narrowest cell whose state holds the prime. For primes wider than the widest
+trained cell it emits the honest ``[0]`` fallback without invoking the network.
 """
 from __future__ import annotations
 # Bit-widths we may ship a cell for, narrowest first. load() picks up whichever
 # weights{W}.pt files are actually present, so adding a wider cell is drop-in.
+CELL_WIDTHS = (16, 32, 64, 128, 256)
 # Default state width for the 16-bit trainer (train.py imports this).
 BITS = 16