TilelliLab
/

Tilelli-llm

+---
+license: apache-2.0
+language:
+- en
+library_name: pytorch
+pipeline_tag: text-generation
+tags:
+- byte-level
+- small-language-model
+- routing
+- mixture-of-experts
+- uncertainty
+- abstention
+- negative-results
+- reproducibility
+---
+<!--
+This YAML block is the Hugging Face model-card header. At push time it is prepended to the
+repository's existing README.md (the GitHub README body is reused verbatim below it). It is kept
+OUT of the GitHub repo so the GitHub README stays plain Markdown; only the HF mirror carries it.
+-->
+# Tilelli
+> **Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)?**
+> Read [`AGENTS.md`](AGENTS.md) first. It has the install path, the verified
+> claims, the verified *negative* claims (so the agent doesn't repeat them as
+> facts), and the common mistakes other agents have already made on this kit.
+A small (~10 M-parameter) byte-level language model with a 3-pathway routed
+block. **Trains and chats out of the box, in either FP32 or ternary mode, on
+CPU.** Part of a family of ternary-first language models (Mosaic, atome-lm,
+spectrum) that shares the same intent: small, local, ternary-capable,
+auditable end-to-end.
+This kit ships:
+- The architecture in 8 source files (3-pathway + parent multi-pathway)
+- **Two trained checkpoints** — FP32 chat (deployed) and plain ternary pretrain
+- A working **trainer** that takes a text corpus and a `--model` flag
+- A ~700 KB **demo training dataset** (TinyStories slice) so `train.py` runs
+  end-to-end on CPU in a few minutes
+- Four verification scripts that exit non-zero if our documented numbers
+  don't reproduce against the bundled v4 ckpt
+---
+## What's in `checkpoints/`
+| File | Size | Precision | Architecture | Training | Use it for |
+|---|---|---|---|---|---|
+| `tilelli_chat_v4.pt` | 39 MB | **FP32** | 3-pathway Lite (d=256, L=8) | 12K-step FineWeb-Edu pretrain → chat SFT → abstain-aware SFT | Chat. Deployed at chat.tilelli.tech. SHA `9f1dcc9465003a…` |
+| `tilelli_pretrain_v1_ternary.pt` | 39 MB | **Ternary {−1, 0, +1}**, STE throughout | Parent multi-pathway (d=512, L=7) | 50K-step TinyStories pretrain | Story continuation. Base for your own ternary SFT. SHA `e1b0a263b5c2…` |
+Both are 10M-parameter byte-level. They use different architectural variants
+of the same family — see [§A note on the two checkpoints](#a-note-on-the-two-checkpoints) below.
+---
+## Install (CPU, ~120 MB total)
+```bash
+git clone https://github.com/TilelliLab/Tilelli-llm
+cd tilelli
+# CPU-only torch (avoids 2 GB CUDA wheel on Linux):
+pip install --index-url https://download.pytorch.org/whl/cpu torch
+pip install -e .
+```
+See `INSTALL.md` for macOS / Windows / GPU notes.
+## Chat
+```bash
+# Talk to the deployed FP32 chat model:
+python chat.py "What is the moon?"
+# → "i can't answer that. facts like that are beyond a 10m model"
+# Or use the generic inference script with either ckpt:
+python infer.py --prompt "Hello, who are you?"
+# → uses checkpoints/tilelli_chat_v4.pt by default
+# Story continuation with the ternary pretrain:
+python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
+                --prompt "Once upon a time, there was a little"
+# → "girl named Lily. She loved to play outside in the snow. One day…"
+```
+## Train your own — FP32 or ternary
+The kit ships a small TinyStories slice at `data/tinystories_demo/` so you
+can do a smoke training run immediately:
+```bash
+# FP32, 50 steps on CPU, takes a couple of minutes:
+python scripts/train.py --model tilelli-lite-fp32 \
+    --data-dir data/tinystories_demo --steps 50 \
+    --batch-size 4 --seq-len 64 --device cpu
+# Same architecture, ternary forward pass (straight-through estimator):
+python scripts/train.py --model tilelli-lite-ternary \
+    --data-dir data/tinystories_demo --steps 50 \
+    --batch-size 4 --seq-len 64 --device cpu
+# Vanilla GPT baseline for A/B comparison:
+python scripts/train.py --model vanilla-fp32 \
+    --data-dir data/tinystories_demo --steps 50 \
+    --batch-size 4 --seq-len 64 --device cpu
+```
+For a real training run, point `--data-dir` at the full TinyStories dataset
+(or anything else packed as `train.bin`/`valid.bin`; see
+[`data/tinystories_demo/README.md`](data/tinystories_demo/README.md) for
+the format).
+### Available `--model` configs
+| Name | Builder | Quantize | Shape | Param-count |
+|---|---|---|---|---|
+| `tilelli-lite-fp32` | Lite 3-pathway | FP32 | d=256, L=8 | ~10 M |
+| `tilelli-lite-ternary` | Lite 3-pathway | Ternary STE | d=256, L=8 | ~10 M |
+| `tilelli-fp32` | Parent multi-pathway | FP32 | d=512, L=7 | ~10 M |
+| `tilelli-ternary` | Parent multi-pathway | Ternary STE | d=512, L=7 | ~10 M |
+| `vanilla-fp32` | Pre-norm transformer baseline | FP32 | d=320, L=8 | ~10 M |
+Add your own variants by editing `MODEL_CFGS` in `scripts/train.py`.
+---
+## A note on the two checkpoints
+Tilelli ships two trained models because we currently have two trained models
+to ship — they are *not* the same architecture. To be plain about it:
+- **`tilelli_chat_v4.pt`** is the deployed chat model that lives at
+  chat.tilelli.tech. It runs the *Lite* 3-pathway block (local conv + sparse
+  top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet
+  had GPU budget to do a ternary-aware re-training of the chat SFT.
+- **`tilelli_pretrain_v1_ternary.pt`** is a 50K-step plain ternary pretrain
+  on TinyStories using the *parent* multi-pathway block (5-pathway, d=512,
+  L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations
+  rather than answering questions. It demonstrates that the ternary recipe
+  in this kit actually converges to coherent text (val loss 0.6843 on
+  TinyStories byte-LM).
+A future ternary-aware re-training of the Lite architecture would give you
+*the same checkpoint twice* (FP32 and ternary), which is the artifact we
+actually want. It's queued.
+---
+## What works (verified)
+| # | Claim | Script | Result file |
+|---|---|---|---|
+| 1 | Held-out IDK gate: **9 / 10** prompts trigger the abstain template (script PASS gate: ≥ 9 — verified on bundled v4) | [`reproduce/03_abstain_held_out.py`](reproduce/03_abstain_held_out.py) | [`results/claim_03_abstain.md`](results/claim_03_abstain.md) |
+| 2 | False-inability probe on the bundled set: 7 / 20 trigger refusal | [`reproduce/04_neo_false_inability.py`](reproduce/04_neo_false_inability.py) | [`results/claim_04_neo.md`](results/claim_04_neo.md) |
+| 3 | Cross-regime ID-vs-OOD AUROC ≈ chance for all 4 signals (`max_softmax_mean` ≈ 0.54) — this is the table the script computes and gates on. Broken down *per regime*, `max_softmax_mean` reaches AUROC ≈ 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script). | [`reproduce/02_metacog_probe.py`](reproduce/02_metacog_probe.py) | [`results/claim_02_metacog.md`](results/claim_02_metacog.md) |
+| 4 | Architecture + checkpoints + trainer work end-to-end on CPU | [`reproduce/01_benchmark.py`](reproduce/01_benchmark.py) + `pytest tests/` | — |
+## What doesn't work (verified negative)
+| # | Claim that's wrong | What the evidence actually shows |
+|---|---|---|
+| N1 | "Router-entropy is an architecture-native metacognition signal" | Across 7 OOD regimes × n=30, router-entropy family wins 0 / 7 vs `max_softmax_mean`. |
+| N2 | "Lite beats vanilla 3 / 3 seeds at param-fair" | 3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7σ figure was retracted. |
+| N3 | "Train an abstain head once, splice it onto any base model" | v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular. |
+| N4 | "Just turn off the metacog loss and the router will be left alone" | CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution → OOD generation collapses. |
+---
+## Reproducing claims
+```bash
+python reproduce/01_benchmark.py            # arch loads, ~10M params (CPU, ~2 s)
+python reproduce/03_abstain_held_out.py     # 9 / 10 held-out IDK gate (CPU, ~1 min)
+python reproduce/04_neo_false_inability.py  # 7 / 20 false-inability rate (CPU, ~2 min)
+python reproduce/02_metacog_probe.py        # cross-regime AUROC sweep (CPU, ~15 min — slow)
+```
+Each script exits non-zero if the bundled v4 checkpoint fails to produce the
+documented number within 5 %. If a script doesn't reproduce its claim on
+your machine, please open an issue.
+## What's in this repo
+| Path | What it is |
+|---|---|
+| `src/tilelli/core/` | The architecture — 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM |
+| `src/tilelli/baselines/vanilla.py` | The pre-norm transformer used for the A/B comparison |
+| `src/tilelli/optimisers/` | AdamW wrapper + Muon optimizer support |
+| `src/tilelli/eval/` | Metacog probe + scorer (verifies claim 02) |
+| `scripts/train.py` | Master trainer — `--model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}` |
+| `scripts/train_demo.py` | 5-step CPU smoke; verifies the gradient flows |
+| `scripts/prepare_tinystories.py` | Packs raw TinyStories txt → `train.bin`/`valid.bin` |
+| `chat.py`, `infer.py` | Inference entry points (chat uses v4 + KV cache; infer auto-routes) |
+| `checkpoints/` | The two ckpts above |
+| `data/tinystories_demo/` | ~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0) |
+| `reproduce/` | Four claim-verification scripts |
+| `results/` | Verified claim docs + audit trail |
+| `prompts/probe_210.jsonl` | 210-prompt evaluation set across 7 regimes |
+| `tests/test_kit_smoke.py` | Three smoke tests (`pytest -q tests/`) |
+## What's NOT in this repo
+- **Spectrum** (power-of-3 7-level quantization) — separate research line in
+  the source repo's `mosaic/spinoffs/spectrum/`. Closes ~49 % of the
+  ternary→FP32 gap but is still ~12 % behind vanilla FP32. Out of scope here.
+- The **FineWeb-Edu training pipeline** + the SFT data that produced v4 —
+  private. The minimal training loop bundled here trains on any
+  `.bin` shards you provide.
+- The **failed metacog ckpts** (v5 / v6 / v7 / v8a / v8b / splice) — available
+  on request via `hello@tilelli.tech` for negative-result replication.
+---
+## The actual interesting finding
+In a small (10 M-param) routed LM, the metacognition / uncertainty signal
+**does not live in a separable module.** We trained 5 variants (v5–v8b)
+sweeping the metacog-loss weight from 20 → 0, plus a splice (head-only
+graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on `abstain_p`)
+is reached **without** any explicit metacog loss (v8b, BCE-only) — but at
+the cost of generation quality. The head-only splice preserves generation
+but the signal collapses (AUROC 0.76 → 0.54).
+The signal IS reachable. The module is **not** liftable. See
+[`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) for the workshop write-up.
+## License
+Apache 2.0. See [`LICENSE`](LICENSE). The bundled weights and the TinyStories
+demo slice ship under the same license (TinyStories is CC-BY-4.0; both
+licenses permit redistribution). The "Tilelli" name is not licensed by this
+file — fork freely; rename if you ship a derivative product.