| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - byte-level |
| - small-language-model |
| - routing |
| - mixture-of-experts |
| - uncertainty |
| - abstention |
| - negative-results |
| - reproducibility |
| --- |
| |
| <!-- |
| This YAML block is the Hugging Face model-card header. At push time it is prepended to the |
| repository's existing README.md (the GitHub README body is reused verbatim below it). It is kept |
| OUT of the GitHub repo so the GitHub README stays plain Markdown; only the HF mirror carries it. |
| --> |
| # Tilelli |
|
|
| > **Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)?** |
| > Read [`AGENTS.md`](AGENTS.md) first. It has the install path, the verified |
| > claims, the verified *negative* claims (so the agent doesn't repeat them as |
| > facts), and the common mistakes other agents have already made on this kit. |
|
|
| A small (~10 M-parameter) byte-level language model with a 3-pathway routed |
| block. **Trains and chats out of the box, in either FP32 or ternary mode, on |
| CPU.** Part of a family of ternary-first language models (Mosaic, atome-lm, |
| spectrum) that shares the same intent: small, local, ternary-capable, |
| auditable end-to-end. |
|
|
| This kit ships: |
|
|
| - The architecture in 8 source files (3-pathway + parent multi-pathway) |
| - **Two trained checkpoints** β FP32 chat (deployed) and plain ternary pretrain |
| - A working **trainer** that takes a text corpus and a `--model` flag |
| - A ~700 KB **demo training dataset** (TinyStories slice) so `train.py` runs |
| end-to-end on CPU in a few minutes |
| - Four verification scripts that exit non-zero if our documented numbers |
| don't reproduce against the bundled v4 ckpt |
|
|
| --- |
|
|
| ## What's in `checkpoints/` |
|
|
| | File | Size | Precision | Architecture | Training | Use it for | |
| |---|---|---|---|---|---| |
| | `tilelli_chat_v4.pt` | 39 MB | **FP32** | 3-pathway Lite (d=256, L=8) | 12K-step FineWeb-Edu pretrain β chat SFT β abstain-aware SFT | Chat. Deployed at chat.tilelli.tech. SHA `9f1dcc9465003aβ¦` | |
| | `tilelli_pretrain_v1_ternary.pt` | 39 MB | **Ternary {β1, 0, +1}**, STE throughout | Parent multi-pathway (d=512, L=7) | 50K-step TinyStories pretrain | Story continuation. Base for your own ternary SFT. SHA `e1b0a263b5c2β¦` | |
|
|
| Both are 10M-parameter byte-level. They use different architectural variants |
| of the same family β see [Β§A note on the two checkpoints](#a-note-on-the-two-checkpoints) below. |
|
|
| --- |
|
|
| ## Install (CPU, ~120 MB total) |
|
|
| ```bash |
| git clone https://github.com/TilelliLab/Tilelli-llm |
| cd tilelli |
| # CPU-only torch (avoids 2 GB CUDA wheel on Linux): |
| pip install --index-url https://download.pytorch.org/whl/cpu torch |
| pip install -e . |
| ``` |
|
|
| See `INSTALL.md` for macOS / Windows / GPU notes. |
|
|
| ## Chat |
|
|
| ```bash |
| # Talk to the deployed FP32 chat model: |
| python chat.py "What is the moon?" |
| # β "i can't answer that. facts like that are beyond a 10m model" |
| |
| # Or use the generic inference script with either ckpt: |
| python infer.py --prompt "Hello, who are you?" |
| # β uses checkpoints/tilelli_chat_v4.pt by default |
| |
| # Story continuation with the ternary pretrain: |
| python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \ |
| --prompt "Once upon a time, there was a little" |
| # β "girl named Lily. She loved to play outside in the snow. One dayβ¦" |
| ``` |
|
|
| ## Train your own β FP32 or ternary |
|
|
| The kit ships a small TinyStories slice at `data/tinystories_demo/` so you |
| can do a smoke training run immediately: |
|
|
| ```bash |
| # FP32, 50 steps on CPU, takes a couple of minutes: |
| python scripts/train.py --model tilelli-lite-fp32 \ |
| --data-dir data/tinystories_demo --steps 50 \ |
| --batch-size 4 --seq-len 64 --device cpu |
| |
| # Same architecture, ternary forward pass (straight-through estimator): |
| python scripts/train.py --model tilelli-lite-ternary \ |
| --data-dir data/tinystories_demo --steps 50 \ |
| --batch-size 4 --seq-len 64 --device cpu |
| |
| # Vanilla GPT baseline for A/B comparison: |
| python scripts/train.py --model vanilla-fp32 \ |
| --data-dir data/tinystories_demo --steps 50 \ |
| --batch-size 4 --seq-len 64 --device cpu |
| ``` |
|
|
| For a real training run, point `--data-dir` at the full TinyStories dataset |
| (or anything else packed as `train.bin`/`valid.bin`; see |
| [`data/tinystories_demo/README.md`](data/tinystories_demo/README.md) for |
| the format). |
|
|
| ### Available `--model` configs |
|
|
| | Name | Builder | Quantize | Shape | Param-count | |
| |---|---|---|---|---| |
| | `tilelli-lite-fp32` | Lite 3-pathway | FP32 | d=256, L=8 | ~10 M | |
| | `tilelli-lite-ternary` | Lite 3-pathway | Ternary STE | d=256, L=8 | ~10 M | |
| | `tilelli-fp32` | Parent multi-pathway | FP32 | d=512, L=7 | ~10 M | |
| | `tilelli-ternary` | Parent multi-pathway | Ternary STE | d=512, L=7 | ~10 M | |
| | `vanilla-fp32` | Pre-norm transformer baseline | FP32 | d=320, L=8 | ~10 M | |
|
|
| Add your own variants by editing `MODEL_CFGS` in `scripts/train.py`. |
|
|
| --- |
|
|
| ## A note on the two checkpoints |
|
|
| Tilelli ships two trained models because we currently have two trained models |
| to ship β they are *not* the same architecture. To be plain about it: |
|
|
| - **`tilelli_chat_v4.pt`** is the deployed chat model that lives at |
| chat.tilelli.tech. It runs the *Lite* 3-pathway block (local conv + sparse |
| top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet |
| had GPU budget to do a ternary-aware re-training of the chat SFT. |
|
|
| - **`tilelli_pretrain_v1_ternary.pt`** is a 50K-step plain ternary pretrain |
| on TinyStories using the *parent* multi-pathway block (5-pathway, d=512, |
| L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations |
| rather than answering questions. It demonstrates that the ternary recipe |
| in this kit actually converges to coherent text (val loss 0.6843 on |
| TinyStories byte-LM). |
| |
| A future ternary-aware re-training of the Lite architecture would give you |
| *the same checkpoint twice* (FP32 and ternary), which is the artifact we |
| actually want. It's queued. |
| |
| --- |
| |
| ## What works (verified) |
| |
| | # | Claim | Script | Result file | |
| |---|---|---|---| |
| | 1 | Held-out IDK gate: **9 / 10** prompts trigger the abstain template (script PASS gate: β₯ 9 β verified on bundled v4) | [`reproduce/03_abstain_held_out.py`](reproduce/03_abstain_held_out.py) | [`results/claim_03_abstain.md`](results/claim_03_abstain.md) | |
| | 2 | False-inability probe on the bundled set: 7 / 20 trigger refusal | [`reproduce/04_neo_false_inability.py`](reproduce/04_neo_false_inability.py) | [`results/claim_04_neo.md`](results/claim_04_neo.md) | |
| | 3 | Cross-regime ID-vs-OOD AUROC β chance for all 4 signals (`max_softmax_mean` β 0.54) β this is the table the script computes and gates on. Broken down *per regime*, `max_softmax_mean` reaches AUROC β 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script). | [`reproduce/02_metacog_probe.py`](reproduce/02_metacog_probe.py) | [`results/claim_02_metacog.md`](results/claim_02_metacog.md) | |
| | 4 | Architecture + checkpoints + trainer work end-to-end on CPU | [`reproduce/01_benchmark.py`](reproduce/01_benchmark.py) + `pytest tests/` | β | |
| |
| ## What doesn't work (verified negative) |
| |
| | # | Claim that's wrong | What the evidence actually shows | |
| |---|---|---| |
| | N1 | "Router-entropy is an architecture-native metacognition signal" | Across 7 OOD regimes Γ n=30, router-entropy family wins 0 / 7 vs `max_softmax_mean`. | |
| | N2 | "Lite beats vanilla 3 / 3 seeds at param-fair" | 3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7Ο figure was retracted. | |
| | N3 | "Train an abstain head once, splice it onto any base model" | v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular. | |
| | N4 | "Just turn off the metacog loss and the router will be left alone" | CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution β OOD generation collapses. | |
| |
| --- |
| |
| ## Reproducing claims |
| |
| ```bash |
| python reproduce/01_benchmark.py # arch loads, ~10M params (CPU, ~2 s) |
| python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min) |
| python reproduce/04_neo_false_inability.py # 7 / 20 false-inability rate (CPU, ~2 min) |
| python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min β slow) |
| ``` |
| |
| Each script exits non-zero if the bundled v4 checkpoint fails to produce the |
| documented number within 5 %. If a script doesn't reproduce its claim on |
| your machine, please open an issue. |
| |
| ## What's in this repo |
| |
| | Path | What it is | |
| |---|---| |
| | `src/tilelli/core/` | The architecture β 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM | |
| | `src/tilelli/baselines/vanilla.py` | The pre-norm transformer used for the A/B comparison | |
| | `src/tilelli/optimisers/` | AdamW wrapper + Muon optimizer support | |
| | `src/tilelli/eval/` | Metacog probe + scorer (verifies claim 02) | |
| | `scripts/train.py` | Master trainer β `--model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}` | |
| | `scripts/train_demo.py` | 5-step CPU smoke; verifies the gradient flows | |
| | `scripts/prepare_tinystories.py` | Packs raw TinyStories txt β `train.bin`/`valid.bin` | |
| | `chat.py`, `infer.py` | Inference entry points (chat uses v4 + KV cache; infer auto-routes) | |
| | `checkpoints/` | The two ckpts above | |
| | `data/tinystories_demo/` | ~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0) | |
| | `reproduce/` | Four claim-verification scripts | |
| | `results/` | Verified claim docs + audit trail | |
| | `prompts/probe_210.jsonl` | 210-prompt evaluation set across 7 regimes | |
| | `tests/test_kit_smoke.py` | Three smoke tests (`pytest -q tests/`) | |
| |
| ## What's NOT in this repo |
| |
| - **Spectrum** (power-of-3 7-level quantization) β separate research line in |
| the source repo's `mosaic/spinoffs/spectrum/`. Closes ~49 % of the |
| ternaryβFP32 gap but is still ~12 % behind vanilla FP32. Out of scope here. |
| - The **FineWeb-Edu training pipeline** + the SFT data that produced v4 β |
| private. The minimal training loop bundled here trains on any |
| `.bin` shards you provide. |
| - The **failed metacog ckpts** (v5 / v6 / v7 / v8a / v8b / splice) β available |
| on request via `hello@tilelli.tech` for negative-result replication. |
|
|
| --- |
|
|
| ## The actual interesting finding |
|
|
| In a small (10 M-param) routed LM, the metacognition / uncertainty signal |
| **does not live in a separable module.** We trained 5 variants (v5βv8b) |
| sweeping the metacog-loss weight from 20 β 0, plus a splice (head-only |
| graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on `abstain_p`) |
| is reached **without** any explicit metacog loss (v8b, BCE-only) β but at |
| the cost of generation quality. The head-only splice preserves generation |
| but the signal collapses (AUROC 0.76 β 0.54). |
|
|
| The signal IS reachable. The module is **not** liftable. See |
| [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) for the workshop write-up. |
|
|
| ## License |
|
|
| Apache 2.0. See [`LICENSE`](LICENSE). The bundled weights and the TinyStories |
| demo slice ship under the same license (TinyStories is CC-BY-4.0; both |
| licenses permit redistribution). The "Tilelli" name is not licensed by this |
| file β fork freely; rename if you ship a derivative product. |
|
|