etwk commited on
Commit
b69f4ce
·
1 Parent(s): 47e319e

Tier-7 256-bit carry-aware TCN cell: highest_tier 6->7, overall 0.602->0.698, tier-7 0.98

Browse files

Routes p<2^256 to a new 256-bit TCN cell (12 blocks/256ch/dilations 1-128,
~4.7M params; CELL_WIDTHS now 16/32/64/128/256). Full public benchmark:
tier 7 = 0.98, overall 0.698, highest_tier_above_90 = 7, deterministic.
Compliance: 0.98@s0 -> 0.06@s0.25 -> 0.02@s0.5, untrained 0.00.
README + manifest updated to five cells / up to 2^256.

Files changed (3) hide show
  1. README.md +30 -21
  2. manifest.json +2 -2
  3. model.py +8 -8
README.md CHANGED
@@ -9,17 +9,18 @@ tags:
9
  - neural-algorithm
10
  ---
11
 
12
- # Horner-RNN — learned modular multiplication up to 2¹²
13
 
14
  A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to
15
- **2¹²**, by *learning the Horner step of double-and-add* rather than memorising
16
  multiplication tables. Entry for the
17
  [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
18
 
19
- - **Saturates tiers 1–6** (all primes `< 2¹²`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%, **tier 6 = 97%**
20
- - **overall_accuracy 0.602**, `highest_tier_above_90 = 6`
21
- - The 128-bit (tier-6) cell is a **carry-aware TCN** (weight-shared dilated convolutions over the
22
- 128 bit-positions, ~3.9M params) — a far better inductive bias for long carry chains than the MLP
 
23
  - Verifiably **generalises to primes never seen in training** (held-out-prime validation
24
  accuracy tracks training accuracy — no memorisation gap)
25
 
@@ -33,8 +34,9 @@ t_{k+1} = (2·t_k + a_bit_k · b) mod p # one learned step (Horner)
33
  answer = t_N (N = bit width of p)
34
  ```
35
 
36
- The model is an RNN whose **transition function — an MLP is trained on exactly that
37
- single-step map** over binary-encoded inputs. The hidden state is a quantized bit vector
 
38
  (a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per
39
  step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one
40
  per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted
@@ -46,7 +48,7 @@ The single-step function is **piecewise linear** (`2t + bit·b`, then subtract `
46
 
47
  ## Files / cells
48
 
49
- The model ships **four cells** and routes each problem to the narrowest one whose state
50
  holds the prime:
51
 
52
  | File | Cell | Primes | Tiers | Arch | Params | Public benchmark |
@@ -55,22 +57,25 @@ holds the prime:
55
  | `weights32.pt` | 32-bit | `< 2³²` | 4 | MLP, 6144 / 4 | ~114M | tier 4 = 0.99 |
56
  | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | MLP, 4096 / 7, residual | ~236M | tier 5 = 0.98 |
57
  | `weights128.pt` | 128-bit | `< 2¹²⁸` | 6 | **carry-aware TCN**, 256ch / 10 blocks, dilations 1–64 | ~3.9M | **tier 6 = 0.97** |
58
-
59
- The 128-bit cell switches architecture: instead of a full-width MLP it is a **non-causal
60
- dilated 1-D convolutional network over the 128 bit-positions**. Carry propagation is
61
- *position-invariant* the same carry/borrow rule applies at every bit so a weight-shared
62
- convolution learns **one** rule applied everywhere (non-causal, so the addition carry flows
63
- LSB→MSB and the mod-`p` compare/borrow flows MSB→LSB), rather than an MLP learning 128 separate
64
- position-functions. This inductive bias drives the per-step error roughly **15× lower** than the
65
- same-task MLP, lifting tier 6 from 0.26 to **0.97** with a cell **~60× smaller** (16 MB vs ~950 MB).
 
 
 
66
 
67
  The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
68
  modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
69
  compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
70
  The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
71
  examples on the **states the chain actually visits** (the true Horner trajectory) rather than
72
- uniformly sampled `t` — see *Training*. For `p ≥ 2⁶` the model emits the honest `[0]`
73
- fallback without invoking the network.
74
 
75
  Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
76
  (challenge manifest), `train.py` (the 16-bit trainer).
@@ -91,7 +96,7 @@ import torch
91
  from model import HornerRNN # model.py from this repo
92
 
93
  m = HornerRNN()
94
- m.load(".") # auto-loads weights{16,32,64}.pt from this dir
95
  # returns base-2 digits, MSB-first; the harness decodes them to the integer
96
  digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0]
97
  answer = int("".join(map(str, digits)), 2)
@@ -121,6 +126,7 @@ cell is *at* the floor. The capability therefore resides in the trained paramete
121
  | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
122
  | tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
123
  | tier 6 (128-bit TCN) | 0.97 | 0.96 | 0.98 | 0.19 | 0.02 | 0.00 |
 
124
 
125
  Generalisation against memorisation: 10% of primes at each bit-width were held out of
126
  training entirely; chain accuracy on them matches the training primes.
@@ -137,7 +143,10 @@ the recurrence (ordinary supervised BCE on the same single-step target). The 128
137
  cell is trained the same single-step way but as the **carry-aware TCN** over a high-diversity
138
  pool of thousands of distinct 124–128 bit primes; its weight-shared dilated-convolution bias
139
  reaches a per-step error ~15× lower than the same-task MLP, giving **tier 6 = 0.97** in a single
140
- short run. Training code and the full write-up live in the solutions repo (link in the model card
 
 
 
141
  metadata / challenge leaderboard).
142
 
143
  ## License
 
9
  - neural-algorithm
10
  ---
11
 
12
+ # Horner-RNN — learned modular multiplication up to 2²⁵⁶
13
 
14
  A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to
15
+ **2²⁵⁶**, by *learning the Horner step of double-and-add* rather than memorising
16
  multiplication tables. Entry for the
17
  [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
18
 
19
+ - **Saturates tiers 1–7** (all primes `< 2²⁵⁶`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%, tier 6 = 97%, **tier 7 = 98%**
20
+ - **overall_accuracy 0.698**, `highest_tier_above_90 = 7`
21
+ - The 128-bit (tier-6) and 256-bit (tier-7) cells are **carry-aware TCNs** (weight-shared dilated
22
+ convolutions over the bit-positions, ~4–5M params each) — a far better inductive bias for long
23
+ carry chains than the MLP, and the key to the per-step precision a 128/256-step chain demands
24
  - Verifiably **generalises to primes never seen in training** (held-out-prime validation
25
  accuracy tracks training accuracy — no memorisation gap)
26
 
 
34
  answer = t_N (N = bit width of p)
35
  ```
36
 
37
+ The model is an RNN whose **transition function — an MLP for the 16/32/64-bit cells, a
38
+ carry-aware TCN for the 128/256-bit cells is trained on exactly that single-step map**
39
+ over binary-encoded inputs. The hidden state is a quantized bit vector
40
  (a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per
41
  step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one
42
  per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted
 
48
 
49
  ## Files / cells
50
 
51
+ The model ships **five cells** and routes each problem to the narrowest one whose state
52
  holds the prime:
53
 
54
  | File | Cell | Primes | Tiers | Arch | Params | Public benchmark |
 
57
  | `weights32.pt` | 32-bit | `< 2³²` | 4 | MLP, 6144 / 4 | ~114M | tier 4 = 0.99 |
58
  | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | MLP, 4096 / 7, residual | ~236M | tier 5 = 0.98 |
59
  | `weights128.pt` | 128-bit | `< 2¹²⁸` | 6 | **carry-aware TCN**, 256ch / 10 blocks, dilations 1–64 | ~3.9M | **tier 6 = 0.97** |
60
+ | `weights256.pt` | 256-bit | `< 2²⁵⁶` | 7 | **carry-aware TCN**, 256ch / 12 blocks, dilations 1–128 | ~4.7M | **tier 7 = 0.98** |
61
+
62
+ The 128- and 256-bit cells switch architecture: instead of a full-width MLP each is a **non-causal
63
+ dilated 1-D convolutional network over the bit-positions** (128 and 256 respectively). Carry
64
+ propagation is *position-invariant* the same carry/borrow rule applies at every bit — so a
65
+ weight-shared convolution learns **one** rule applied everywhere (non-causal, so the addition carry
66
+ flows LSB→MSB and the mod-`p` compare/borrow flows MSB→LSB), rather than an MLP learning a separate
67
+ position-function per bit. This inductive bias drives the per-step error roughly **15× lower** than the
68
+ same-task MLP — the difference between a 128/256-step chain landing at ~0.26 and at **0.97 / 0.98** —
69
+ in cells **~60× smaller** than the wide MLPs (16–19 MB each vs ~950 MB). The receptive field of each
70
+ TCN spans its full width in both carry directions, so a carry can propagate across the entire word.
71
 
72
  The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
73
  modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
74
  compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
75
  The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
76
  examples on the **states the chain actually visits** (the true Horner trajectory) rather than
77
+ uniformly sampled `t` — see *Training*. For `p ≥ 2²⁵⁶` (wider than the widest trained cell) the
78
+ model emits the honest `[0]` fallback without invoking the network.
79
 
80
  Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
81
  (challenge manifest), `train.py` (the 16-bit trainer).
 
96
  from model import HornerRNN # model.py from this repo
97
 
98
  m = HornerRNN()
99
+ m.load(".") # auto-loads weights{16,32,64,128,256}.pt from this dir
100
  # returns base-2 digits, MSB-first; the harness decodes them to the integer
101
  digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0]
102
  answer = int("".join(map(str, digits)), 2)
 
126
  | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
127
  | tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
128
  | tier 6 (128-bit TCN) | 0.97 | 0.96 | 0.98 | 0.19 | 0.02 | 0.00 |
129
+ | tier 7 (256-bit TCN) | 0.98 | 0.97 | 0.99 | 0.06 | 0.02 | 0.00 |
130
 
131
  Generalisation against memorisation: 10% of primes at each bit-width were held out of
132
  training entirely; chain accuracy on them matches the training primes.
 
143
  cell is trained the same single-step way but as the **carry-aware TCN** over a high-diversity
144
  pool of thousands of distinct 124–128 bit primes; its weight-shared dilated-convolution bias
145
  reaches a per-step error ~15× lower than the same-task MLP, giving **tier 6 = 0.97** in a single
146
+ short run. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions
147
+ (dilations cycling 1–128), trained identically on true-trajectory single steps over distinct
148
+ 252–256 bit primes; its per-step error is low enough that the 256-step chain holds at **tier 7 =
149
+ 0.98**. Training code and the full write-up live in the solutions repo (link in the model card
150
  metadata / challenge leaderboard).
151
 
152
  ## License
manifest.json CHANGED
@@ -2,6 +2,6 @@
2
  "entry_class": "model.HornerRNN",
3
  "output_base": 2,
4
  "framework": "pytorch",
5
- "model_description": "Bit-sequential RNN (~405M params across four cells) for primes up to 2^128. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Four cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell (MLP, width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5, and a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params). The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning 128 separate position-functions; this inductive bias drives the per-step error far below what the MLP cell reached and lifts tier 6 to 0.97. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^128 emits the honest [0] fallback without invoking the network.",
6
- "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way — single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p — from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. Weight-perturbation compliance (exploration/compliance_perturb.py): tier-6 accuracy 0.97 at sigma=0 collapses toward the floor as the conv weights are perturbed (0.19 at sigma=0.25, 0.02 at sigma=0.5) and an untrained cell scores 0.00 — the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN)."
7
  }
 
2
  "entry_class": "model.HornerRNN",
3
  "output_base": 2,
4
  "framework": "pytorch",
5
+ "model_description": "Bit-sequential RNN (~410M params across five cells) for primes up to 2^256. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Five cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell (MLP, width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5, a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params), and a 256-bit cell for p < 2^256 covering tier 7 that uses the SAME carry-aware TCN architecture scaled to 256 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..128 so the receptive field spans all 256 bits, ~4.7M params), lifting tier 7 to 0.98. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128- and 256-bit chains (which compound the per-step error over 128 and 256 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^256 emits the honest [0] fallback without invoking the network.",
6
+ "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way — single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p — from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions (dilations cycling 1..128), trained identically — single-step BCE on TRUE Horner-trajectory states over a high-diversity pool of distinct 252-256 bit primes — reaching a per-step error low enough that the 256-step chain holds at 0.98 full-chain accuracy on held-out 252-256 bit primes. Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy at sigma=0 collapses toward the floor as the weights are perturbed and an untrained re-init scores 0.00 e.g. tier 6 0.97 -> 0.19 (sigma=0.25) -> 0.02 (sigma=0.5) and tier 7 0.98 -> 0.06 (sigma=0.25) -> 0.02 (sigma=0.5), untrained 0.00 for both so the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN), exploration/train_horner_tcn.py --bits 256 (256-bit carry-aware TCN)."
7
  }
model.py CHANGED
@@ -5,8 +5,8 @@ one per step, conditioned on ``(b mod p, p)`` in binary. The hidden state is a
5
  quantized bit vector (a discrete bottleneck — a hard VQ layer with a fixed
6
  binary codebook), and the transition function — an MLP for the 16/32/64-bit
7
  cells, a weight-shared carry-aware dilated-conv TCN (TCNHornerCell) for the
8
- 128-bit cell — is entirely trained parameters. After the last bit, the hidden
9
- state bits ARE the answer, emitted MSB-first in base 2.
10
 
11
  Why this is interesting: for the recurrence to end on the right answer, the
12
  trained cell must *learn* the map ``(t, bit, b, p) -> (2t + bit*b) mod p`` —
@@ -24,16 +24,16 @@ line is respected here:
24
  random weights the output is noise (Principle 2), and the emitted digits are
25
  exactly the model's final hidden state (Principle 1).
26
  - learned (all of the actual arithmetic): the transition function. Nothing in
27
- the code adds, multiplies, compares against p, or carries; the cell's MLP
28
- weights had to learn all of that from data.
29
 
30
  The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
31
  the same legal input normalisation every other reference model uses.
32
 
33
  The model ships one cell per bit-width (16 -> tiers 1-3, 32 -> tier 4, 64 ->
34
- tier 5, and 128 -> tier 6 when present) and routes each problem to the narrowest
35
- cell whose state holds the prime. For primes wider than the widest trained cell
36
- it emits the honest ``[0]`` fallback without invoking the network.
37
  """
38
 
39
  from __future__ import annotations
@@ -48,7 +48,7 @@ from modchallenge.interface.base_model import ModularMultiplicationModel
48
 
49
  # Bit-widths we may ship a cell for, narrowest first. load() picks up whichever
50
  # weights{W}.pt files are actually present, so adding a wider cell is drop-in.
51
- CELL_WIDTHS = (16, 32, 64, 128)
52
 
53
  # Default state width for the 16-bit trainer (train.py imports this).
54
  BITS = 16
 
5
  quantized bit vector (a discrete bottleneck — a hard VQ layer with a fixed
6
  binary codebook), and the transition function — an MLP for the 16/32/64-bit
7
  cells, a weight-shared carry-aware dilated-conv TCN (TCNHornerCell) for the
8
+ 128- and 256-bit cells — is entirely trained parameters. After the last bit,
9
+ the hidden state bits ARE the answer, emitted MSB-first in base 2.
10
 
11
  Why this is interesting: for the recurrence to end on the right answer, the
12
  trained cell must *learn* the map ``(t, bit, b, p) -> (2t + bit*b) mod p`` —
 
24
  random weights the output is noise (Principle 2), and the emitted digits are
25
  exactly the model's final hidden state (Principle 1).
26
  - learned (all of the actual arithmetic): the transition function. Nothing in
27
+ the code adds, multiplies, compares against p, or carries; the cell's trained
28
+ weights (MLP or carry-aware TCN) had to learn all of that from data.
29
 
30
  The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
31
  the same legal input normalisation every other reference model uses.
32
 
33
  The model ships one cell per bit-width (16 -> tiers 1-3, 32 -> tier 4, 64 ->
34
+ tier 5, 128 -> tier 6, and 256 -> tier 7 when present) and routes each problem to
35
+ the narrowest cell whose state holds the prime. For primes wider than the widest
36
+ trained cell it emits the honest ``[0]`` fallback without invoking the network.
37
  """
38
 
39
  from __future__ import annotations
 
48
 
49
  # Bit-widths we may ship a cell for, narrowest first. load() picks up whichever
50
  # weights{W}.pt files are actually present, so adding a wider cell is drop-in.
51
+ CELL_WIDTHS = (16, 32, 64, 128, 256)
52
 
53
  # Default state width for the 16-bit trainer (train.py imports this).
54
  BITS = 16