Horner-RNN modular-multiplication model (tiers 1-5, up to 2^64)

Bit-sequential RNN that learns the Horner step (2t+bit*b) mod p; three cells
(16/32/64-bit) routed by prime size. Public benchmark: tiers 1-3 = 1.00,
tier 4 = 0.99, tier 5 = 0.64, overall 0.473. Weights weights{16,32,64}.pt are tracked as git-LFS objects.

Files changed (5) hide show

.gitattributes +1 -0
README.md +124 -0
manifest.json +7 -0
model.py +218 -0
train.py +221 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.pt filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+---
+license: apache-2.0
+library_name: pytorch
+tags:
+  - modular-arithmetic
+  - algorithmic-reasoning
+  - rnn
+  - number-theory
+  - neural-algorithm
+---
+# Horner-RNN — learned modular multiplication up to 2⁶⁴
+A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to
+**2⁶⁴**, by *learning the Horner step of double-and-add* rather than memorising
+multiplication tables. Entry for the
+[Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).
+- **Saturates tiers 1–4** (all primes `< 2³²`): tiers 1–3 = 100%, tier 4 = 99%
+- **Tier 5** (33–64-bit primes) = 0.64 on the public benchmark
+- **overall_accuracy 0.473**, `highest_tier_above_90 = 4`
+- Verifiably **generalises to primes never seen in training** (held-out-prime validation
+  accuracy tracks training accuracy — no memorisation gap)
+## The idea
+Write `a` in bits, MSB-first; then `a·b mod p` is the iterate of one small map:
+```
+t_0 = 0
+t_{k+1} = (2·t_k + a_bit_k · b) mod p      # one learned step (Horner)
+answer  = t_N           (N = bit width of p)
+```
+The model is an RNN whose **transition function — an MLP — is trained on exactly that
+single-step map** over binary-encoded inputs. The hidden state is a quantized bit vector
+(a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per
+step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one
+per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted
+MSB-first as the base-2 answer (`output_base: 2`).
+The single-step function is **piecewise linear** (`2t + bit·b`, then subtract `0`, `p`, or
+`2p`), which is why it generalises across primes where the full bilinear map
+`(a,b) → a·b mod p` does not.
+## Files / cells
+The model ships **three cells** and routes each problem to the narrowest one whose state
+holds the prime:
+| File | Cell | Primes | Tiers | Width/depth | Params | Public benchmark |
+|---|---|---|---|---|---|---|
+| `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
+| `weights32.pt` | 32-bit | `< 2³²` | 4 | 6144 / 4 | ~114M | tier 4 = 0.99 |
+| `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | 4096 / 7, residual | ~236M | tier 5 = 0.64 |
+The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
+modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
+compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
+For `p ≥ 2⁶⁴` the model emits the honest `[0]` fallback without invoking the network.
+Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
+(challenge manifest), `train.py` (the 16-bit trainer).
+## Usage
+This is a challenge submission; the base class lives in the challenge package, so install it
+first:
+```bash
+pip install "git+https://github.com/SAIRcompetition/modular-arithmetic-challenge"
+```
+Direct inference:
+```python
+import torch
+from model import HornerRNN          # model.py from this repo
+m = HornerRNN()
+m.load(".")                          # auto-loads weights{16,32,64}.pt from this dir
+# returns base-2 digits, MSB-first; the harness decodes them to the integer
+digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0]
+answer = int("".join(map(str, digits)), 2)
+print(answer)                        # == (123456789 * 987654321) % 4294967291
+```
+Or score it with the official harness:
+```bash
+modchallenge evaluate . --total 1100
+```
+## Compliance (the rules permit *learned* algorithms, not hand-coded ones)
+The **scan** (tokenise `a mod p` into bits, iterate, read out the final state) is
+architecture — it computes nothing by itself. The **arithmetic** (doubling, conditional
+add, compare-against-`p`, carries) all lives in the trained cell weights; nothing in the
+code adds, multiplies, or compares against `p`.
+**Principle 2, measured** — perturbing the cell weights with Gaussian noise scaled to each
+tensor's std collapses accuracy toward the floor, and a fully re-initialised (untrained)
+cell is *at* the floor. The capability therefore resides in the trained parameters:
+| noise σ (×param std) | 0 | 0.05 | 0.1 | 0.25 | 0.5 | untrained |
+|---|---|---|---|---|---|---|
+| tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
+| tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
+| tier 5 (64-bit cell) | 0.64 | 0.57 | 0.41 | 0.01 | 0.01 | 0.00 |
+Generalisation against memorisation: 10% of primes at each bit-width were held out of
+training entirely; chain accuracy on them matches the training primes.
+## Training
+Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range,
+half of each batch mined near the comparison boundary where errors concentrate; BCE per
+state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
+primes. Training code and the full write-up live in the solutions repo (link in the model
+card metadata / challenge leaderboard).
+## License
+Apache-2.0, matching the challenge.

manifest.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "entry_class": "model.HornerRNN",
+  "output_base": 2,
+  "framework": "pytorch",
+  "model_description": "Bit-sequential RNN (~400M params across three cells) for primes up to 2^64. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function is an MLP that must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Three cells are shipped and routed by prime size: a 16-bit cell (width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, and a 64-bit cell (width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5 — the wider carry chains of a 64-bit modular step need the extra depth. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^64 emits the honest [0] fallback without invoking the network.",
+  "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch (more for the 64-bit fine-tune) mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit, --residual)."
+}

model.py ADDED Viewed

	@@ -0,0 +1,218 @@

+"""Compliant bit-sequential RNN for modular multiplication up to 2^W-bit primes.
+Architecture: a recurrent network that reads the bits of ``a mod p`` MSB-first,
+one per step, conditioned on ``(b mod p, p)`` in binary. The hidden state is a
+quantized bit vector (a discrete bottleneck — a hard VQ layer with a fixed
+binary codebook), and the transition function — an MLP — is entirely trained
+parameters. After the last bit, the hidden state bits ARE the answer, emitted
+MSB-first in base 2.
+Why this is interesting: for the recurrence to end on the right answer, the
+trained cell must *learn* the map ``(t, bit, b, p) -> (2t + bit*b) mod p`` —
+i.e. the model is trained to internally implement one step of Horner evaluation
+in the prime field, and it verifiably generalises to a held-out 10% of primes
+never seen in training (val == train accuracy). The rules explicitly permit
+recurrent/looped architectures and models that *learn* an algorithm-like circuit
+("A model trained to internally implement an algorithm is permitted; the same
+algorithm hand-coded into the forward pass is not" — rules/evaluation.md). The
+line is respected here:
+  - hand-coded (architecture, weight-independent): tokenising ``a mod p`` into
+    bits, scanning them sequentially, reading the final state bits. This is
+    tokenisation + recurrence + readout — it computes nothing by itself: with
+    random weights the output is noise (Principle 2), and the emitted digits are
+    exactly the model's final hidden state (Principle 1).
+  - learned (all of the actual arithmetic): the transition function. Nothing in
+    the code adds, multiplies, compares against p, or carries; the cell's MLP
+    weights had to learn all of that from data.
+The two-operand reductions ``a mod p`` / ``b mod p`` in ``predict_digits`` are
+the same legal input normalisation every other reference model uses.
+The model ships one cell per bit-width (16 -> tiers 1-3, 32 -> tier 4, and 64 ->
+tier 5 when present) and routes each problem to the narrowest cell whose state
+holds the prime. For primes wider than the widest trained cell it emits the
+honest ``[0]`` fallback without invoking the network.
+"""
+from __future__ import annotations
+from pathlib import Path
+import numpy as np
+import torch
+import torch.nn as nn
+from modchallenge.interface.base_model import ModularMultiplicationModel
+# Bit-widths we may ship a cell for, narrowest first. load() picks up whichever
+# weights{W}.pt files are actually present, so adding a wider cell is drop-in.
+CELL_WIDTHS = (16, 32, 64)
+# Default state width for the 16-bit trainer (train.py imports this).
+BITS = 16
+class _ResBlock(nn.Module):
+    """Pre-norm residual MLP block: x + Linear(GELU(Linear(LN(x))))."""
+    def __init__(self, width: int):
+        super().__init__()
+        self.ln = nn.LayerNorm(width)
+        self.fc1 = nn.Linear(width, width)
+        self.fc2 = nn.Linear(width, width)
+    def forward(self, x):
+        return x + self.fc2(torch.nn.functional.gelu(self.fc1(self.ln(x))))
+class HornerCell(nn.Module):
+    """Learned RNN transition: (state_bits, bit, b_bits, p_bits) -> next-state logits.
+    ``residual=False`` (default) is the plain GELU stack used by the 16/32-bit
+    cells — its module/parameter layout is unchanged so existing checkpoints
+    load. ``residual=True`` swaps the trunk for pre-norm residual blocks after
+    an input projection, which stay trainable at the larger depth the 64-bit
+    carry chains need (exact n-bit carry propagation wants depth ~log2(n)). The
+    flag lives in ``config`` so older checkpoints (no ``residual`` key) load as
+    the plain stack.
+    """
+    def __init__(self, width: int = 4096, depth: int = 4, bits: int = 16, residual: bool = False):
+        super().__init__()
+        self.residual = residual
+        if residual:
+            self.proj = nn.Linear(3 * bits + 1, width)
+            self.trunk = nn.Sequential(*[_ResBlock(width) for _ in range(depth)])
+        else:
+            layers: list[nn.Module] = [nn.Linear(3 * bits + 1, width), nn.GELU()]
+            for _ in range(depth - 1):
+                layers += [nn.Linear(width, width), nn.GELU()]
+            self.trunk = nn.Sequential(*layers)
+        self.head = nn.Linear(width, bits)
+        self.config = dict(width=width, depth=depth, bits=bits, residual=residual)
+    def forward(self, tb, bit, bb, pb):
+        x = torch.cat([tb, bit, bb, pb], dim=-1)
+        if self.residual:
+            x = self.proj(x)
+        return self.head(self.trunk(x))
+def _to_bits(t: torch.Tensor, bits: int = 16) -> torch.Tensor:
+    """(N,) int64 -> (N, bits) float in {0,1}, LSB-first.
+    Used by the trainer for <= 32-bit values. Inference uses the numpy packer
+    below (bit-identical for <= 32 bits, and also valid at 64 bits where an
+    int64 tensor would overflow). Kept here so the trainer can import it.
+    """
+    shifts = torch.arange(bits, device=t.device)
+    return ((t.unsqueeze(1) >> shifts) & 1).float()
+def _pack_bits(vals: list[int], nbits: int, device) -> torch.Tensor:
+    """list[int] (each < 2^nbits) -> (N, nbits) float bit tensor, LSB-first.
+    Works for any nbits divisible by 8, including 64 where the torch shift
+    trick overflows int64. Verified bit-identical to ``_to_bits`` for 16/32.
+    """
+    nbytes = nbits // 8
+    buf = b"".join(int(v).to_bytes(nbytes, "little") for v in vals)
+    arr = np.frombuffer(buf, dtype=np.uint8).reshape(len(vals), nbytes)
+    bits = np.unpackbits(arr, axis=1, bitorder="little").astype(np.float32)
+    return torch.from_numpy(bits).to(device)
+class HornerRNN(ModularMultiplicationModel):
+    """Routes each problem to the narrowest trained cell that fits its prime."""
+    def __init__(self):
+        # width -> HornerCell, populated from whichever weight files exist.
+        self.cells: dict[int, HornerCell] = {}
+        self.device: torch.device | None = None
+    def load(self, model_dir: str) -> None:
+        if torch.cuda.is_available():
+            self.device = torch.device("cuda")
+        elif torch.backends.mps.is_available():
+            self.device = torch.device("mps")
+        else:
+            self.device = torch.device("cpu")
+        for width in CELL_WIDTHS:
+            path = Path(model_dir) / f"weights{width}.pt"
+            if not path.exists():
+                continue
+            ckpt = torch.load(path, map_location=self.device, weights_only=True)
+            cell = HornerCell(**ckpt.get("config", {}))
+            cell.load_state_dict(ckpt["state_dict"])
+            cell.to(self.device)
+            cell.eval()
+            self.cells[width] = cell
+        if not self.cells:
+            raise FileNotFoundError(
+                f"no weights{{{','.join(map(str, CELL_WIDTHS))}}}.pt found in {model_dir}"
+            )
+    def preprocess_a(self, a):
+        return a
+    def preprocess_b(self, b):
+        return b
+    def preprocess_p(self, p):
+        return p
+    @torch.no_grad()
+    def predict_digits(self, a_enc, b_enc, p_enc):
+        return self.predict_digits_batch([(a_enc, b_enc, p_enc)])[0]
+    @torch.no_grad()
+    def _run_cell(self, width: int, rows: list[tuple[int, int, int]]) -> list[list[int]]:
+        """Scan the width-bit cell over a batch of (a_red, b_red, p) rows."""
+        cell = self.cells[width]
+        a_bits = _pack_bits([r[0] for r in rows], width, self.device)
+        bb = _pack_bits([r[1] for r in rows], width, self.device)
+        pb = _pack_bits([r[2] for r in rows], width, self.device)
+        state = torch.zeros(len(rows), width, device=self.device)
+        # RNN scan over the bit tokens of (a mod p), MSB-first. The scan moves
+        # data; the learned cell does all the computing.
+        for s in range(width - 1, -1, -1):
+            bit = a_bits[:, s : s + 1]
+            logits = cell(state, bit, bb, pb)
+            state = (logits > 0).float()  # quantized state bottleneck
+        return state.long().tolist()  # LSB-first per row
+    @torch.no_grad()
+    def predict_digits_batch(self, inputs):
+        assert self.cells, "load() must run first"
+        out: list[list[int] | None] = [None] * len(inputs)
+        widths = sorted(self.cells)
+        widest = widths[-1]
+        # Bucket each problem by the narrowest cell whose state holds the prime.
+        buckets: dict[int, tuple[list[int], list[tuple[int, int, int]]]] = {
+            w: ([], []) for w in widths
+        }
+        for i, (a_enc, b_enc, p_enc) in enumerate(inputs):
+            p = int(p_enc)
+            if p >= (1 << widest):
+                out[i] = [0]  # outside every trained regime: honest fallback
+                continue
+            w = next(w for w in widths if p < (1 << w))
+            idx, rows = buckets[w]
+            idx.append(i)
+            rows.append((int(a_enc) % p, int(b_enc) % p, p))
+        for w in widths:
+            idx, rows = buckets[w]
+            if rows:
+                bits = self._run_cell(w, rows)
+                for j, i in enumerate(idx):
+                    out[i] = bits[j][::-1]  # emit MSB-first, base 2
+        return [o if o is not None else [0] for o in out]
+    def max_batch_size(self) -> int:
+        return 1024

train.py ADDED Viewed

	@@ -0,0 +1,221 @@

+"""Train the horner_rnn transition cell (bit-level Horner step) + chain fine-tuning.
+Stage 1: train cell f(t, bit, b, p) = (2t + bit*b) mod p (quotients {0,1,2},
+easier than base-4's {0..6}) with grad clipping, EMA, hard-boundary mining.
+Stage 2 (optional, default off): fine-tune end-to-end through the 16-step
+chain with a straight-through estimator on the quantized state, loss on every
+step's ground-truth intermediate. In practice this was destructive at lr2=5e-5
+(chain val collapsed); the shipped weights come from stage 1 alone, which
+reaches chain val ~0.998 on held-out primes. Kept for further experimentation
+at lower learning rates.
+"""
+from __future__ import annotations
+import argparse
+import time
+import sys
+from pathlib import Path
+import torch
+import torch.nn as nn
+# Import the shared architecture from the sibling model.py.
+HERE = Path(__file__).resolve().parent
+sys.path.insert(0, str(HERE))
+from model import HornerCell, BITS, _to_bits as to_bits  # noqa: E402
+def sieve_primes(limit: int) -> list[int]:
+    is_p = bytearray([1]) * limit
+    is_p[0] = is_p[1] = 0
+    for i in range(2, int(limit ** 0.5) + 1):
+        if is_p[i]:
+            is_p[i * i :: i] = bytearray(len(is_p[i * i :: i]))
+    return [i for i in range(2, limit) if is_p[i]]
+def sample_batch(primes_t, n, device, hard_frac=0.5):
+    p = primes_t[torch.randint(len(primes_t), (n,), device=device)]
+    b = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
+    bit = torch.randint(0, 2, (n,), device=device)
+    n_hard = int(n * hard_frac)
+    t = torch.empty(n, dtype=torch.long, device=device)
+    t[n_hard:] = (torch.rand(n - n_hard, device=device) * p[n_hard:]).long()
+    if n_hard:
+        ph, bh, bith = p[:n_hard], b[:n_hard], bit[:n_hard]
+        q = torch.randint(0, 3, (n_hard,), device=device)
+        delta = torch.randint(-2, 3, (n_hard,), device=device)
+        th = (q * ph + delta - bith * bh) >> 1
+        t[:n_hard] = th.clamp(min=0) % ph
+    z = (2 * t + bit * b) % p
+    return t, bit, b, p, z
+@torch.no_grad()
+def exact_rate(model, primes_t, device, n=200_000, bs=65536) -> float:
+    ok = 0
+    for i in range(0, n, bs):
+        m = min(bs, n - i)
+        t, bit, b, p, z = sample_batch(primes_t, m, device, hard_frac=0.0)
+        logits = model(to_bits(t), bit.float().unsqueeze(1), to_bits(b), to_bits(p))
+        ok += ((logits > 0).long() == to_bits(z).long()).all(dim=1).sum().item()
+    return ok / n
+@torch.no_grad()
+def chain_exact_rate(model, primes_t, device, n=20_000) -> float:
+    p = primes_t[torch.randint(len(primes_t), (n,), device=device)]
+    a = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
+    b = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
+    truth = (a * b) % p
+    bb, pb = to_bits(b), to_bits(p)
+    tb = torch.zeros(n, BITS, device=device)
+    for i in range(BITS - 1, -1, -1):
+        bit = ((a >> i) & 1).float().unsqueeze(1)
+        tb = (model(tb, bit, bb, pb) > 0).float()
+    pred = (tb.long() * (1 << torch.arange(BITS, device=device))).sum(dim=1)
+    return (pred == truth).float().mean().item()
+def chain_finetune_batch(model, primes_t, n, device, loss_fn):
+    """One end-to-end pass: STE state, per-step CE against true intermediates."""
+    p = primes_t[torch.randint(len(primes_t), (n,), device=device)]
+    a = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
+    b = (torch.rand(n, device=device) * p).long().clamp(max=p - 1)
+    bb, pb = to_bits(b), to_bits(p)
+    tb = torch.zeros(n, BITS, device=device)
+    t_true = torch.zeros_like(a)
+    loss = torch.zeros((), device=device)
+    for i in range(BITS - 1, -1, -1):
+        bit_i = (a >> i) & 1
+        t_true = (2 * t_true + bit_i * b) % p
+        logits = model(tb, bit_i.float().unsqueeze(1), bb, pb)
+        loss = loss + loss_fn(logits, to_bits(t_true))
+        hard = (logits > 0).float()
+        soft = torch.sigmoid(logits)
+        tb = hard + (soft - soft.detach())  # straight-through
+    return loss / BITS
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--stage1-minutes", type=float, default=50.0)
+    ap.add_argument("--stage2-minutes", type=float, default=0.0)
+    ap.add_argument("--batch", type=int, default=32768)
+    ap.add_argument("--chain-batch", type=int, default=4096)
+    ap.add_argument("--lr", type=float, default=3e-4)
+    ap.add_argument("--lr2", type=float, default=5e-5)
+    ap.add_argument("--width", type=int, default=4096)
+    ap.add_argument("--depth", type=int, default=4)
+    ap.add_argument("--init", type=str, default="")
+    ap.add_argument("--out", type=str, default=str(HERE / "weights16.pt"))
+    args = ap.parse_args()
+    device = torch.device("cuda")
+    torch.manual_seed(0)
+    small = sieve_primes(256)
+    primes = [p for p in sieve_primes(1 << 16) if p >= 256]
+    g = torch.Generator().manual_seed(1)
+    perm = torch.randperm(len(primes), generator=g).tolist()
+    val_primes = torch.tensor([primes[i] for i in perm[: len(primes) // 10]], device=device)
+    train_primes = torch.tensor(
+        small + [primes[i] for i in perm[len(primes) // 10 :]], device=device
+    )
+    print(f"train primes {len(train_primes)}, val primes {len(val_primes)}")
+    model = HornerCell(args.width, args.depth).to(device)
+    if args.init:
+        ckpt = torch.load(args.init, map_location=device, weights_only=True)
+        model.load_state_dict(ckpt["state_dict"])
+        print(f"initialised from {args.init}")
+    ema = HornerCell(args.width, args.depth).to(device)
+    ema.load_state_dict(model.state_dict())
+    for q in ema.parameters():
+        q.requires_grad_(False)
+    print(f"params: {sum(t.numel() for t in model.parameters()):,}")
+    loss_fn = nn.BCEWithLogitsLoss()
+    EMA_DECAY = 0.999
+    def update_ema():
+        with torch.no_grad():
+            for q, w in zip(ema.parameters(), model.parameters()):
+                q.lerp_(w, 1 - EMA_DECAY)
+    best_chain = -1.0
+    def save_if_best(tag):
+        nonlocal best_chain
+        ch = chain_exact_rate(ema, val_primes, device)
+        if ch > best_chain:
+            best_chain = ch
+            torch.save({"state_dict": ema.state_dict(), "config": ema.config}, args.out)
+        return ch
+    # ----- Stage 1: cell training -----
+    if args.stage1_minutes > 0:
+        opt = torch.optim.AdamW(model.parameters(), lr=args.lr, weight_decay=1e-5)
+        total_steps = int(args.stage1_minutes * 60 * 16)
+        sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=total_steps, eta_min=args.lr * 0.02)
+        deadline = time.monotonic() + args.stage1_minutes * 60
+        start = time.monotonic()
+        step = 0
+        while time.monotonic() < deadline:
+            t, bit, b, p, z = sample_batch(train_primes, args.batch, device)
+            logits = model(to_bits(t), bit.float().unsqueeze(1), to_bits(b), to_bits(p))
+            loss = loss_fn(logits, to_bits(z))
+            opt.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            opt.step()
+            if step < total_steps:
+                sched.step()
+            update_ema()
+            step += 1
+            if step % 1000 == 0:
+                va = exact_rate(ema, val_primes, device, n=100_000)
+                ch = save_if_best("s1")
+                print(
+                    f"S1 step {step:6d} | loss {loss.item():.5f} | ema cell val {va:.5f} "
+                    f"| ema CHAIN val {ch:.4f} | {time.monotonic()-start:.0f}s",
+                    flush=True,
+                )
+    # ----- Stage 2: end-to-end chain fine-tuning (STE) -----
+    if args.stage2_minutes > 0:
+        opt = torch.optim.AdamW(model.parameters(), lr=args.lr2, weight_decay=1e-5)
+        total_steps = int(args.stage2_minutes * 60 * 3)
+        sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=total_steps, eta_min=args.lr2 * 0.1)
+        deadline = time.monotonic() + args.stage2_minutes * 60
+        start = time.monotonic()
+        step = 0
+        while time.monotonic() < deadline:
+            loss = chain_finetune_batch(model, train_primes, args.chain_batch, device, loss_fn)
+            opt.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            opt.step()
+            if step < total_steps:
+                sched.step()
+            update_ema()
+            step += 1
+            if step % 200 == 0:
+                va = exact_rate(ema, val_primes, device, n=100_000)
+                ch = save_if_best("s2")
+                print(
+                    f"S2 step {step:6d} | loss {loss.item():.5f} | ema cell val {va:.5f} "
+                    f"| ema CHAIN val {ch:.4f} | {time.monotonic()-start:.0f}s",
+                    flush=True,
+                )
+    va = exact_rate(ema, val_primes, device, n=500_000)
+    ch = chain_exact_rate(ema, val_primes, device, n=50_000)
+    print(f"FINAL ema cell val {va:.6f} | chain val {ch:.4f} | best chain {best_chain:.4f}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())