add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,236 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- byte-level
|
| 9 |
+
- small-language-model
|
| 10 |
+
- routing
|
| 11 |
+
- mixture-of-experts
|
| 12 |
+
- uncertainty
|
| 13 |
+
- abstention
|
| 14 |
+
- negative-results
|
| 15 |
+
- reproducibility
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
<!--
|
| 19 |
+
This YAML block is the Hugging Face model-card header. At push time it is prepended to the
|
| 20 |
+
repository's existing README.md (the GitHub README body is reused verbatim below it). It is kept
|
| 21 |
+
OUT of the GitHub repo so the GitHub README stays plain Markdown; only the HF mirror carries it.
|
| 22 |
+
-->
|
| 23 |
+
# Tilelli
|
| 24 |
+
|
| 25 |
+
> **Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)?**
|
| 26 |
+
> Read [`AGENTS.md`](AGENTS.md) first. It has the install path, the verified
|
| 27 |
+
> claims, the verified *negative* claims (so the agent doesn't repeat them as
|
| 28 |
+
> facts), and the common mistakes other agents have already made on this kit.
|
| 29 |
+
|
| 30 |
+
A small (~10 M-parameter) byte-level language model with a 3-pathway routed
|
| 31 |
+
block. **Trains and chats out of the box, in either FP32 or ternary mode, on
|
| 32 |
+
CPU.** Part of a family of ternary-first language models (Mosaic, atome-lm,
|
| 33 |
+
spectrum) that shares the same intent: small, local, ternary-capable,
|
| 34 |
+
auditable end-to-end.
|
| 35 |
+
|
| 36 |
+
This kit ships:
|
| 37 |
+
|
| 38 |
+
- The architecture in 8 source files (3-pathway + parent multi-pathway)
|
| 39 |
+
- **Two trained checkpoints** β FP32 chat (deployed) and plain ternary pretrain
|
| 40 |
+
- A working **trainer** that takes a text corpus and a `--model` flag
|
| 41 |
+
- A ~700 KB **demo training dataset** (TinyStories slice) so `train.py` runs
|
| 42 |
+
end-to-end on CPU in a few minutes
|
| 43 |
+
- Four verification scripts that exit non-zero if our documented numbers
|
| 44 |
+
don't reproduce against the bundled v4 ckpt
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## What's in `checkpoints/`
|
| 49 |
+
|
| 50 |
+
| File | Size | Precision | Architecture | Training | Use it for |
|
| 51 |
+
|---|---|---|---|---|---|
|
| 52 |
+
| `tilelli_chat_v4.pt` | 39 MB | **FP32** | 3-pathway Lite (d=256, L=8) | 12K-step FineWeb-Edu pretrain β chat SFT β abstain-aware SFT | Chat. Deployed at chat.tilelli.tech. SHA `9f1dcc9465003aβ¦` |
|
| 53 |
+
| `tilelli_pretrain_v1_ternary.pt` | 39 MB | **Ternary {β1, 0, +1}**, STE throughout | Parent multi-pathway (d=512, L=7) | 50K-step TinyStories pretrain | Story continuation. Base for your own ternary SFT. SHA `e1b0a263b5c2β¦` |
|
| 54 |
+
|
| 55 |
+
Both are 10M-parameter byte-level. They use different architectural variants
|
| 56 |
+
of the same family β see [Β§A note on the two checkpoints](#a-note-on-the-two-checkpoints) below.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Install (CPU, ~120 MB total)
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
git clone https://github.com/TilelliLab/Tilelli-llm
|
| 64 |
+
cd tilelli
|
| 65 |
+
# CPU-only torch (avoids 2 GB CUDA wheel on Linux):
|
| 66 |
+
pip install --index-url https://download.pytorch.org/whl/cpu torch
|
| 67 |
+
pip install -e .
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
See `INSTALL.md` for macOS / Windows / GPU notes.
|
| 71 |
+
|
| 72 |
+
## Chat
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
# Talk to the deployed FP32 chat model:
|
| 76 |
+
python chat.py "What is the moon?"
|
| 77 |
+
# β "i can't answer that. facts like that are beyond a 10m model"
|
| 78 |
+
|
| 79 |
+
# Or use the generic inference script with either ckpt:
|
| 80 |
+
python infer.py --prompt "Hello, who are you?"
|
| 81 |
+
# β uses checkpoints/tilelli_chat_v4.pt by default
|
| 82 |
+
|
| 83 |
+
# Story continuation with the ternary pretrain:
|
| 84 |
+
python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
|
| 85 |
+
--prompt "Once upon a time, there was a little"
|
| 86 |
+
# β "girl named Lily. She loved to play outside in the snow. One dayβ¦"
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Train your own β FP32 or ternary
|
| 90 |
+
|
| 91 |
+
The kit ships a small TinyStories slice at `data/tinystories_demo/` so you
|
| 92 |
+
can do a smoke training run immediately:
|
| 93 |
+
|
| 94 |
+
```bash
|
| 95 |
+
# FP32, 50 steps on CPU, takes a couple of minutes:
|
| 96 |
+
python scripts/train.py --model tilelli-lite-fp32 \
|
| 97 |
+
--data-dir data/tinystories_demo --steps 50 \
|
| 98 |
+
--batch-size 4 --seq-len 64 --device cpu
|
| 99 |
+
|
| 100 |
+
# Same architecture, ternary forward pass (straight-through estimator):
|
| 101 |
+
python scripts/train.py --model tilelli-lite-ternary \
|
| 102 |
+
--data-dir data/tinystories_demo --steps 50 \
|
| 103 |
+
--batch-size 4 --seq-len 64 --device cpu
|
| 104 |
+
|
| 105 |
+
# Vanilla GPT baseline for A/B comparison:
|
| 106 |
+
python scripts/train.py --model vanilla-fp32 \
|
| 107 |
+
--data-dir data/tinystories_demo --steps 50 \
|
| 108 |
+
--batch-size 4 --seq-len 64 --device cpu
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
For a real training run, point `--data-dir` at the full TinyStories dataset
|
| 112 |
+
(or anything else packed as `train.bin`/`valid.bin`; see
|
| 113 |
+
[`data/tinystories_demo/README.md`](data/tinystories_demo/README.md) for
|
| 114 |
+
the format).
|
| 115 |
+
|
| 116 |
+
### Available `--model` configs
|
| 117 |
+
|
| 118 |
+
| Name | Builder | Quantize | Shape | Param-count |
|
| 119 |
+
|---|---|---|---|---|
|
| 120 |
+
| `tilelli-lite-fp32` | Lite 3-pathway | FP32 | d=256, L=8 | ~10 M |
|
| 121 |
+
| `tilelli-lite-ternary` | Lite 3-pathway | Ternary STE | d=256, L=8 | ~10 M |
|
| 122 |
+
| `tilelli-fp32` | Parent multi-pathway | FP32 | d=512, L=7 | ~10 M |
|
| 123 |
+
| `tilelli-ternary` | Parent multi-pathway | Ternary STE | d=512, L=7 | ~10 M |
|
| 124 |
+
| `vanilla-fp32` | Pre-norm transformer baseline | FP32 | d=320, L=8 | ~10 M |
|
| 125 |
+
|
| 126 |
+
Add your own variants by editing `MODEL_CFGS` in `scripts/train.py`.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## A note on the two checkpoints
|
| 131 |
+
|
| 132 |
+
Tilelli ships two trained models because we currently have two trained models
|
| 133 |
+
to ship β they are *not* the same architecture. To be plain about it:
|
| 134 |
+
|
| 135 |
+
- **`tilelli_chat_v4.pt`** is the deployed chat model that lives at
|
| 136 |
+
chat.tilelli.tech. It runs the *Lite* 3-pathway block (local conv + sparse
|
| 137 |
+
top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet
|
| 138 |
+
had GPU budget to do a ternary-aware re-training of the chat SFT.
|
| 139 |
+
|
| 140 |
+
- **`tilelli_pretrain_v1_ternary.pt`** is a 50K-step plain ternary pretrain
|
| 141 |
+
on TinyStories using the *parent* multi-pathway block (5-pathway, d=512,
|
| 142 |
+
L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations
|
| 143 |
+
rather than answering questions. It demonstrates that the ternary recipe
|
| 144 |
+
in this kit actually converges to coherent text (val loss 0.6843 on
|
| 145 |
+
TinyStories byte-LM).
|
| 146 |
+
|
| 147 |
+
A future ternary-aware re-training of the Lite architecture would give you
|
| 148 |
+
*the same checkpoint twice* (FP32 and ternary), which is the artifact we
|
| 149 |
+
actually want. It's queued.
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## What works (verified)
|
| 154 |
+
|
| 155 |
+
| # | Claim | Script | Result file |
|
| 156 |
+
|---|---|---|---|
|
| 157 |
+
| 1 | Held-out IDK gate: **9 / 10** prompts trigger the abstain template (script PASS gate: β₯ 9 β verified on bundled v4) | [`reproduce/03_abstain_held_out.py`](reproduce/03_abstain_held_out.py) | [`results/claim_03_abstain.md`](results/claim_03_abstain.md) |
|
| 158 |
+
| 2 | False-inability probe on the bundled set: 7 / 20 trigger refusal | [`reproduce/04_neo_false_inability.py`](reproduce/04_neo_false_inability.py) | [`results/claim_04_neo.md`](results/claim_04_neo.md) |
|
| 159 |
+
| 3 | Cross-regime ID-vs-OOD AUROC β chance for all 4 signals (`max_softmax_mean` β 0.54) β this is the table the script computes and gates on. Broken down *per regime*, `max_softmax_mean` reaches AUROC β 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script). | [`reproduce/02_metacog_probe.py`](reproduce/02_metacog_probe.py) | [`results/claim_02_metacog.md`](results/claim_02_metacog.md) |
|
| 160 |
+
| 4 | Architecture + checkpoints + trainer work end-to-end on CPU | [`reproduce/01_benchmark.py`](reproduce/01_benchmark.py) + `pytest tests/` | β |
|
| 161 |
+
|
| 162 |
+
## What doesn't work (verified negative)
|
| 163 |
+
|
| 164 |
+
| # | Claim that's wrong | What the evidence actually shows |
|
| 165 |
+
|---|---|---|
|
| 166 |
+
| N1 | "Router-entropy is an architecture-native metacognition signal" | Across 7 OOD regimes Γ n=30, router-entropy family wins 0 / 7 vs `max_softmax_mean`. |
|
| 167 |
+
| N2 | "Lite beats vanilla 3 / 3 seeds at param-fair" | 3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7Ο figure was retracted. |
|
| 168 |
+
| N3 | "Train an abstain head once, splice it onto any base model" | v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular. |
|
| 169 |
+
| N4 | "Just turn off the metacog loss and the router will be left alone" | CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution β OOD generation collapses. |
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## Reproducing claims
|
| 174 |
+
|
| 175 |
+
```bash
|
| 176 |
+
python reproduce/01_benchmark.py # arch loads, ~10M params (CPU, ~2 s)
|
| 177 |
+
python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min)
|
| 178 |
+
python reproduce/04_neo_false_inability.py # 7 / 20 false-inability rate (CPU, ~2 min)
|
| 179 |
+
python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min β slow)
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
Each script exits non-zero if the bundled v4 checkpoint fails to produce the
|
| 183 |
+
documented number within 5 %. If a script doesn't reproduce its claim on
|
| 184 |
+
your machine, please open an issue.
|
| 185 |
+
|
| 186 |
+
## What's in this repo
|
| 187 |
+
|
| 188 |
+
| Path | What it is |
|
| 189 |
+
|---|---|
|
| 190 |
+
| `src/tilelli/core/` | The architecture β 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM |
|
| 191 |
+
| `src/tilelli/baselines/vanilla.py` | The pre-norm transformer used for the A/B comparison |
|
| 192 |
+
| `src/tilelli/optimisers/` | AdamW wrapper + Muon optimizer support |
|
| 193 |
+
| `src/tilelli/eval/` | Metacog probe + scorer (verifies claim 02) |
|
| 194 |
+
| `scripts/train.py` | Master trainer β `--model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}` |
|
| 195 |
+
| `scripts/train_demo.py` | 5-step CPU smoke; verifies the gradient flows |
|
| 196 |
+
| `scripts/prepare_tinystories.py` | Packs raw TinyStories txt β `train.bin`/`valid.bin` |
|
| 197 |
+
| `chat.py`, `infer.py` | Inference entry points (chat uses v4 + KV cache; infer auto-routes) |
|
| 198 |
+
| `checkpoints/` | The two ckpts above |
|
| 199 |
+
| `data/tinystories_demo/` | ~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0) |
|
| 200 |
+
| `reproduce/` | Four claim-verification scripts |
|
| 201 |
+
| `results/` | Verified claim docs + audit trail |
|
| 202 |
+
| `prompts/probe_210.jsonl` | 210-prompt evaluation set across 7 regimes |
|
| 203 |
+
| `tests/test_kit_smoke.py` | Three smoke tests (`pytest -q tests/`) |
|
| 204 |
+
|
| 205 |
+
## What's NOT in this repo
|
| 206 |
+
|
| 207 |
+
- **Spectrum** (power-of-3 7-level quantization) β separate research line in
|
| 208 |
+
the source repo's `mosaic/spinoffs/spectrum/`. Closes ~49 % of the
|
| 209 |
+
ternaryβFP32 gap but is still ~12 % behind vanilla FP32. Out of scope here.
|
| 210 |
+
- The **FineWeb-Edu training pipeline** + the SFT data that produced v4 β
|
| 211 |
+
private. The minimal training loop bundled here trains on any
|
| 212 |
+
`.bin` shards you provide.
|
| 213 |
+
- The **failed metacog ckpts** (v5 / v6 / v7 / v8a / v8b / splice) β available
|
| 214 |
+
on request via `hello@tilelli.tech` for negative-result replication.
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## The actual interesting finding
|
| 219 |
+
|
| 220 |
+
In a small (10 M-param) routed LM, the metacognition / uncertainty signal
|
| 221 |
+
**does not live in a separable module.** We trained 5 variants (v5βv8b)
|
| 222 |
+
sweeping the metacog-loss weight from 20 β 0, plus a splice (head-only
|
| 223 |
+
graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on `abstain_p`)
|
| 224 |
+
is reached **without** any explicit metacog loss (v8b, BCE-only) β but at
|
| 225 |
+
the cost of generation quality. The head-only splice preserves generation
|
| 226 |
+
but the signal collapses (AUROC 0.76 β 0.54).
|
| 227 |
+
|
| 228 |
+
The signal IS reachable. The module is **not** liftable. See
|
| 229 |
+
[`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) for the workshop write-up.
|
| 230 |
+
|
| 231 |
+
## License
|
| 232 |
+
|
| 233 |
+
Apache 2.0. See [`LICENSE`](LICENSE). The bundled weights and the TinyStories
|
| 234 |
+
demo slice ship under the same license (TinyStories is CC-BY-4.0; both
|
| 235 |
+
licenses permit redistribution). The "Tilelli" name is not licensed by this
|
| 236 |
+
file β fork freely; rename if you ship a derivative product.
|