TilelliLab commited on
Commit
8d72258
Β·
verified Β·
1 Parent(s): 74dba1c

add model card

Browse files
Files changed (1) hide show
  1. README.md +236 -0
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - byte-level
9
+ - small-language-model
10
+ - routing
11
+ - mixture-of-experts
12
+ - uncertainty
13
+ - abstention
14
+ - negative-results
15
+ - reproducibility
16
+ ---
17
+
18
+ <!--
19
+ This YAML block is the Hugging Face model-card header. At push time it is prepended to the
20
+ repository's existing README.md (the GitHub README body is reused verbatim below it). It is kept
21
+ OUT of the GitHub repo so the GitHub README stays plain Markdown; only the HF mirror carries it.
22
+ -->
23
+ # Tilelli
24
+
25
+ > **Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)?**
26
+ > Read [`AGENTS.md`](AGENTS.md) first. It has the install path, the verified
27
+ > claims, the verified *negative* claims (so the agent doesn't repeat them as
28
+ > facts), and the common mistakes other agents have already made on this kit.
29
+
30
+ A small (~10 M-parameter) byte-level language model with a 3-pathway routed
31
+ block. **Trains and chats out of the box, in either FP32 or ternary mode, on
32
+ CPU.** Part of a family of ternary-first language models (Mosaic, atome-lm,
33
+ spectrum) that shares the same intent: small, local, ternary-capable,
34
+ auditable end-to-end.
35
+
36
+ This kit ships:
37
+
38
+ - The architecture in 8 source files (3-pathway + parent multi-pathway)
39
+ - **Two trained checkpoints** β€” FP32 chat (deployed) and plain ternary pretrain
40
+ - A working **trainer** that takes a text corpus and a `--model` flag
41
+ - A ~700 KB **demo training dataset** (TinyStories slice) so `train.py` runs
42
+ end-to-end on CPU in a few minutes
43
+ - Four verification scripts that exit non-zero if our documented numbers
44
+ don't reproduce against the bundled v4 ckpt
45
+
46
+ ---
47
+
48
+ ## What's in `checkpoints/`
49
+
50
+ | File | Size | Precision | Architecture | Training | Use it for |
51
+ |---|---|---|---|---|---|
52
+ | `tilelli_chat_v4.pt` | 39 MB | **FP32** | 3-pathway Lite (d=256, L=8) | 12K-step FineWeb-Edu pretrain β†’ chat SFT β†’ abstain-aware SFT | Chat. Deployed at chat.tilelli.tech. SHA `9f1dcc9465003a…` |
53
+ | `tilelli_pretrain_v1_ternary.pt` | 39 MB | **Ternary {βˆ’1, 0, +1}**, STE throughout | Parent multi-pathway (d=512, L=7) | 50K-step TinyStories pretrain | Story continuation. Base for your own ternary SFT. SHA `e1b0a263b5c2…` |
54
+
55
+ Both are 10M-parameter byte-level. They use different architectural variants
56
+ of the same family β€” see [Β§A note on the two checkpoints](#a-note-on-the-two-checkpoints) below.
57
+
58
+ ---
59
+
60
+ ## Install (CPU, ~120 MB total)
61
+
62
+ ```bash
63
+ git clone https://github.com/TilelliLab/Tilelli-llm
64
+ cd tilelli
65
+ # CPU-only torch (avoids 2 GB CUDA wheel on Linux):
66
+ pip install --index-url https://download.pytorch.org/whl/cpu torch
67
+ pip install -e .
68
+ ```
69
+
70
+ See `INSTALL.md` for macOS / Windows / GPU notes.
71
+
72
+ ## Chat
73
+
74
+ ```bash
75
+ # Talk to the deployed FP32 chat model:
76
+ python chat.py "What is the moon?"
77
+ # β†’ "i can't answer that. facts like that are beyond a 10m model"
78
+
79
+ # Or use the generic inference script with either ckpt:
80
+ python infer.py --prompt "Hello, who are you?"
81
+ # β†’ uses checkpoints/tilelli_chat_v4.pt by default
82
+
83
+ # Story continuation with the ternary pretrain:
84
+ python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
85
+ --prompt "Once upon a time, there was a little"
86
+ # β†’ "girl named Lily. She loved to play outside in the snow. One day…"
87
+ ```
88
+
89
+ ## Train your own β€” FP32 or ternary
90
+
91
+ The kit ships a small TinyStories slice at `data/tinystories_demo/` so you
92
+ can do a smoke training run immediately:
93
+
94
+ ```bash
95
+ # FP32, 50 steps on CPU, takes a couple of minutes:
96
+ python scripts/train.py --model tilelli-lite-fp32 \
97
+ --data-dir data/tinystories_demo --steps 50 \
98
+ --batch-size 4 --seq-len 64 --device cpu
99
+
100
+ # Same architecture, ternary forward pass (straight-through estimator):
101
+ python scripts/train.py --model tilelli-lite-ternary \
102
+ --data-dir data/tinystories_demo --steps 50 \
103
+ --batch-size 4 --seq-len 64 --device cpu
104
+
105
+ # Vanilla GPT baseline for A/B comparison:
106
+ python scripts/train.py --model vanilla-fp32 \
107
+ --data-dir data/tinystories_demo --steps 50 \
108
+ --batch-size 4 --seq-len 64 --device cpu
109
+ ```
110
+
111
+ For a real training run, point `--data-dir` at the full TinyStories dataset
112
+ (or anything else packed as `train.bin`/`valid.bin`; see
113
+ [`data/tinystories_demo/README.md`](data/tinystories_demo/README.md) for
114
+ the format).
115
+
116
+ ### Available `--model` configs
117
+
118
+ | Name | Builder | Quantize | Shape | Param-count |
119
+ |---|---|---|---|---|
120
+ | `tilelli-lite-fp32` | Lite 3-pathway | FP32 | d=256, L=8 | ~10 M |
121
+ | `tilelli-lite-ternary` | Lite 3-pathway | Ternary STE | d=256, L=8 | ~10 M |
122
+ | `tilelli-fp32` | Parent multi-pathway | FP32 | d=512, L=7 | ~10 M |
123
+ | `tilelli-ternary` | Parent multi-pathway | Ternary STE | d=512, L=7 | ~10 M |
124
+ | `vanilla-fp32` | Pre-norm transformer baseline | FP32 | d=320, L=8 | ~10 M |
125
+
126
+ Add your own variants by editing `MODEL_CFGS` in `scripts/train.py`.
127
+
128
+ ---
129
+
130
+ ## A note on the two checkpoints
131
+
132
+ Tilelli ships two trained models because we currently have two trained models
133
+ to ship β€” they are *not* the same architecture. To be plain about it:
134
+
135
+ - **`tilelli_chat_v4.pt`** is the deployed chat model that lives at
136
+ chat.tilelli.tech. It runs the *Lite* 3-pathway block (local conv + sparse
137
+ top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet
138
+ had GPU budget to do a ternary-aware re-training of the chat SFT.
139
+
140
+ - **`tilelli_pretrain_v1_ternary.pt`** is a 50K-step plain ternary pretrain
141
+ on TinyStories using the *parent* multi-pathway block (5-pathway, d=512,
142
+ L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations
143
+ rather than answering questions. It demonstrates that the ternary recipe
144
+ in this kit actually converges to coherent text (val loss 0.6843 on
145
+ TinyStories byte-LM).
146
+
147
+ A future ternary-aware re-training of the Lite architecture would give you
148
+ *the same checkpoint twice* (FP32 and ternary), which is the artifact we
149
+ actually want. It's queued.
150
+
151
+ ---
152
+
153
+ ## What works (verified)
154
+
155
+ | # | Claim | Script | Result file |
156
+ |---|---|---|---|
157
+ | 1 | Held-out IDK gate: **9 / 10** prompts trigger the abstain template (script PASS gate: β‰₯ 9 β€” verified on bundled v4) | [`reproduce/03_abstain_held_out.py`](reproduce/03_abstain_held_out.py) | [`results/claim_03_abstain.md`](results/claim_03_abstain.md) |
158
+ | 2 | False-inability probe on the bundled set: 7 / 20 trigger refusal | [`reproduce/04_neo_false_inability.py`](reproduce/04_neo_false_inability.py) | [`results/claim_04_neo.md`](results/claim_04_neo.md) |
159
+ | 3 | Cross-regime ID-vs-OOD AUROC β‰ˆ chance for all 4 signals (`max_softmax_mean` β‰ˆ 0.54) β€” this is the table the script computes and gates on. Broken down *per regime*, `max_softmax_mean` reaches AUROC β‰ˆ 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script). | [`reproduce/02_metacog_probe.py`](reproduce/02_metacog_probe.py) | [`results/claim_02_metacog.md`](results/claim_02_metacog.md) |
160
+ | 4 | Architecture + checkpoints + trainer work end-to-end on CPU | [`reproduce/01_benchmark.py`](reproduce/01_benchmark.py) + `pytest tests/` | β€” |
161
+
162
+ ## What doesn't work (verified negative)
163
+
164
+ | # | Claim that's wrong | What the evidence actually shows |
165
+ |---|---|---|
166
+ | N1 | "Router-entropy is an architecture-native metacognition signal" | Across 7 OOD regimes Γ— n=30, router-entropy family wins 0 / 7 vs `max_softmax_mean`. |
167
+ | N2 | "Lite beats vanilla 3 / 3 seeds at param-fair" | 3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7Οƒ figure was retracted. |
168
+ | N3 | "Train an abstain head once, splice it onto any base model" | v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular. |
169
+ | N4 | "Just turn off the metacog loss and the router will be left alone" | CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution β†’ OOD generation collapses. |
170
+
171
+ ---
172
+
173
+ ## Reproducing claims
174
+
175
+ ```bash
176
+ python reproduce/01_benchmark.py # arch loads, ~10M params (CPU, ~2 s)
177
+ python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min)
178
+ python reproduce/04_neo_false_inability.py # 7 / 20 false-inability rate (CPU, ~2 min)
179
+ python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min β€” slow)
180
+ ```
181
+
182
+ Each script exits non-zero if the bundled v4 checkpoint fails to produce the
183
+ documented number within 5 %. If a script doesn't reproduce its claim on
184
+ your machine, please open an issue.
185
+
186
+ ## What's in this repo
187
+
188
+ | Path | What it is |
189
+ |---|---|
190
+ | `src/tilelli/core/` | The architecture β€” 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM |
191
+ | `src/tilelli/baselines/vanilla.py` | The pre-norm transformer used for the A/B comparison |
192
+ | `src/tilelli/optimisers/` | AdamW wrapper + Muon optimizer support |
193
+ | `src/tilelli/eval/` | Metacog probe + scorer (verifies claim 02) |
194
+ | `scripts/train.py` | Master trainer β€” `--model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}` |
195
+ | `scripts/train_demo.py` | 5-step CPU smoke; verifies the gradient flows |
196
+ | `scripts/prepare_tinystories.py` | Packs raw TinyStories txt β†’ `train.bin`/`valid.bin` |
197
+ | `chat.py`, `infer.py` | Inference entry points (chat uses v4 + KV cache; infer auto-routes) |
198
+ | `checkpoints/` | The two ckpts above |
199
+ | `data/tinystories_demo/` | ~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0) |
200
+ | `reproduce/` | Four claim-verification scripts |
201
+ | `results/` | Verified claim docs + audit trail |
202
+ | `prompts/probe_210.jsonl` | 210-prompt evaluation set across 7 regimes |
203
+ | `tests/test_kit_smoke.py` | Three smoke tests (`pytest -q tests/`) |
204
+
205
+ ## What's NOT in this repo
206
+
207
+ - **Spectrum** (power-of-3 7-level quantization) β€” separate research line in
208
+ the source repo's `mosaic/spinoffs/spectrum/`. Closes ~49 % of the
209
+ ternary→FP32 gap but is still ~12 % behind vanilla FP32. Out of scope here.
210
+ - The **FineWeb-Edu training pipeline** + the SFT data that produced v4 β€”
211
+ private. The minimal training loop bundled here trains on any
212
+ `.bin` shards you provide.
213
+ - The **failed metacog ckpts** (v5 / v6 / v7 / v8a / v8b / splice) β€” available
214
+ on request via `hello@tilelli.tech` for negative-result replication.
215
+
216
+ ---
217
+
218
+ ## The actual interesting finding
219
+
220
+ In a small (10 M-param) routed LM, the metacognition / uncertainty signal
221
+ **does not live in a separable module.** We trained 5 variants (v5–v8b)
222
+ sweeping the metacog-loss weight from 20 β†’ 0, plus a splice (head-only
223
+ graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on `abstain_p`)
224
+ is reached **without** any explicit metacog loss (v8b, BCE-only) β€” but at
225
+ the cost of generation quality. The head-only splice preserves generation
226
+ but the signal collapses (AUROC 0.76 β†’ 0.54).
227
+
228
+ The signal IS reachable. The module is **not** liftable. See
229
+ [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) for the workshop write-up.
230
+
231
+ ## License
232
+
233
+ Apache 2.0. See [`LICENSE`](LICENSE). The bundled weights and the TinyStories
234
+ demo slice ship under the same license (TinyStories is CC-BY-4.0; both
235
+ licenses permit redistribution). The "Tilelli" name is not licensed by this
236
+ file β€” fork freely; rename if you ship a derivative product.