add model card

8d72258 verified 15 days ago

11.3 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: pytorch
	pipeline_tag: text-generation
	tags:
	- byte-level
	- small-language-model
	- routing
	- mixture-of-experts
	- uncertainty
	- abstention
	- negative-results
	- reproducibility
	---

	<!--
	This YAML block is the Hugging Face model-card header. At push time it is prepended to the
	repository's existing README.md (the GitHub README body is reused verbatim below it). It is kept
	OUT of the GitHub repo so the GitHub README stays plain Markdown; only the HF mirror carries it.
	-->
	# Tilelli

	> Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)?
	> Read [`AGENTS.md`](AGENTS.md) first. It has the install path, the verified
	> claims, the verified negative claims (so the agent doesn't repeat them as
	> facts), and the common mistakes other agents have already made on this kit.

	A small (~10 M-parameter) byte-level language model with a 3-pathway routed
	block. **Trains and chats out of the box, in either FP32 or ternary mode, on
	CPU.** Part of a family of ternary-first language models (Mosaic, atome-lm,
	spectrum) that shares the same intent: small, local, ternary-capable,
	auditable end-to-end.

	This kit ships:

	- The architecture in 8 source files (3-pathway + parent multi-pathway)
	- Two trained checkpoints — FP32 chat (deployed) and plain ternary pretrain
	- A working trainer that takes a text corpus and a `--model` flag
	- A ~700 KB demo training dataset (TinyStories slice) so `train.py` runs
	end-to-end on CPU in a few minutes
	- Four verification scripts that exit non-zero if our documented numbers
	don't reproduce against the bundled v4 ckpt

	---

	## What's in `checkpoints/`

	\| File \| Size \| Precision \| Architecture \| Training \| Use it for \|
	\|---\|---\|---\|---\|---\|---\|
	\| `tilelli_chat_v4.pt` \| 39 MB \| FP32 \| 3-pathway Lite (d=256, L=8) \| 12K-step FineWeb-Edu pretrain → chat SFT → abstain-aware SFT \| Chat. Deployed at chat.tilelli.tech. SHA `9f1dcc9465003a…` \|
	\| `tilelli_pretrain_v1_ternary.pt` \| 39 MB \| Ternary {−1, 0, +1}, STE throughout \| Parent multi-pathway (d=512, L=7) \| 50K-step TinyStories pretrain \| Story continuation. Base for your own ternary SFT. SHA `e1b0a263b5c2…` \|

	Both are 10M-parameter byte-level. They use different architectural variants
	of the same family — see [§A note on the two checkpoints](#a-note-on-the-two-checkpoints) below.

	---

	## Install (CPU, ~120 MB total)

	```bash
	git clone https://github.com/TilelliLab/Tilelli-llm
	cd tilelli
	# CPU-only torch (avoids 2 GB CUDA wheel on Linux):
	pip install --index-url https://download.pytorch.org/whl/cpu torch
	pip install -e .
	```

	See `INSTALL.md` for macOS / Windows / GPU notes.

	## Chat

	```bash
	# Talk to the deployed FP32 chat model:
	python chat.py "What is the moon?"
	# → "i can't answer that. facts like that are beyond a 10m model"

	# Or use the generic inference script with either ckpt:
	python infer.py --prompt "Hello, who are you?"
	# → uses checkpoints/tilelli_chat_v4.pt by default

	# Story continuation with the ternary pretrain:
	python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
	--prompt "Once upon a time, there was a little"
	# → "girl named Lily. She loved to play outside in the snow. One day…"
	```

	## Train your own — FP32 or ternary

	The kit ships a small TinyStories slice at `data/tinystories_demo/` so you
	can do a smoke training run immediately:

	```bash
	# FP32, 50 steps on CPU, takes a couple of minutes:
	python scripts/train.py --model tilelli-lite-fp32 \
	--data-dir data/tinystories_demo --steps 50 \
	--batch-size 4 --seq-len 64 --device cpu

	# Same architecture, ternary forward pass (straight-through estimator):
	python scripts/train.py --model tilelli-lite-ternary \
	--data-dir data/tinystories_demo --steps 50 \
	--batch-size 4 --seq-len 64 --device cpu

	# Vanilla GPT baseline for A/B comparison:
	python scripts/train.py --model vanilla-fp32 \
	--data-dir data/tinystories_demo --steps 50 \
	--batch-size 4 --seq-len 64 --device cpu
	```

	For a real training run, point `--data-dir` at the full TinyStories dataset
	(or anything else packed as `train.bin`/`valid.bin`; see
	[`data/tinystories_demo/README.md`](data/tinystories_demo/README.md) for
	the format).

	### Available `--model` configs

	\| Name \| Builder \| Quantize \| Shape \| Param-count \|
	\|---\|---\|---\|---\|---\|
	\| `tilelli-lite-fp32` \| Lite 3-pathway \| FP32 \| d=256, L=8 \| ~10 M \|
	\| `tilelli-lite-ternary` \| Lite 3-pathway \| Ternary STE \| d=256, L=8 \| ~10 M \|
	\| `tilelli-fp32` \| Parent multi-pathway \| FP32 \| d=512, L=7 \| ~10 M \|
	\| `tilelli-ternary` \| Parent multi-pathway \| Ternary STE \| d=512, L=7 \| ~10 M \|
	\| `vanilla-fp32` \| Pre-norm transformer baseline \| FP32 \| d=320, L=8 \| ~10 M \|

	Add your own variants by editing `MODEL_CFGS` in `scripts/train.py`.

	---

	## A note on the two checkpoints

	Tilelli ships two trained models because we currently have two trained models
	to ship — they are not the same architecture. To be plain about it:

	- `tilelli_chat_v4.pt` is the deployed chat model that lives at
	chat.tilelli.tech. It runs the Lite 3-pathway block (local conv + sparse
	top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet
	had GPU budget to do a ternary-aware re-training of the chat SFT.

	- `tilelli_pretrain_v1_ternary.pt` is a 50K-step plain ternary pretrain
	on TinyStories using the parent multi-pathway block (5-pathway, d=512,
	L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations
	rather than answering questions. It demonstrates that the ternary recipe
	in this kit actually converges to coherent text (val loss 0.6843 on
	TinyStories byte-LM).

	A future ternary-aware re-training of the Lite architecture would give you
	the same checkpoint twice (FP32 and ternary), which is the artifact we
	actually want. It's queued.

	---

	## What works (verified)

	\| # \| Claim \| Script \| Result file \|
	\|---\|---\|---\|---\|
	\| 1 \| Held-out IDK gate: 9 / 10 prompts trigger the abstain template (script PASS gate: ≥ 9 — verified on bundled v4) \| [`reproduce/03_abstain_held_out.py`](reproduce/03_abstain_held_out.py) \| [`results/claim_03_abstain.md`](results/claim_03_abstain.md) \|
	\| 2 \| False-inability probe on the bundled set: 7 / 20 trigger refusal \| [`reproduce/04_neo_false_inability.py`](reproduce/04_neo_false_inability.py) \| [`results/claim_04_neo.md`](results/claim_04_neo.md) \|
	\| 3 \| Cross-regime ID-vs-OOD AUROC ≈ chance for all 4 signals (`max_softmax_mean` ≈ 0.54) — this is the table the script computes and gates on. Broken down per regime, `max_softmax_mean` reaches AUROC ≈ 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script). \| [`reproduce/02_metacog_probe.py`](reproduce/02_metacog_probe.py) \| [`results/claim_02_metacog.md`](results/claim_02_metacog.md) \|
	\| 4 \| Architecture + checkpoints + trainer work end-to-end on CPU \| [`reproduce/01_benchmark.py`](reproduce/01_benchmark.py) + `pytest tests/` \| — \|

	## What doesn't work (verified negative)

	\| # \| Claim that's wrong \| What the evidence actually shows \|
	\|---\|---\|---\|
	\| N1 \| "Router-entropy is an architecture-native metacognition signal" \| Across 7 OOD regimes × n=30, router-entropy family wins 0 / 7 vs `max_softmax_mean`. \|
	\| N2 \| "Lite beats vanilla 3 / 3 seeds at param-fair" \| 3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7σ figure was retracted. \|
	\| N3 \| "Train an abstain head once, splice it onto any base model" \| v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular. \|
	\| N4 \| "Just turn off the metacog loss and the router will be left alone" \| CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution → OOD generation collapses. \|

	---

	## Reproducing claims

	```bash
	python reproduce/01_benchmark.py # arch loads, ~10M params (CPU, ~2 s)
	python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min)
	python reproduce/04_neo_false_inability.py # 7 / 20 false-inability rate (CPU, ~2 min)
	python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min — slow)
	```

	Each script exits non-zero if the bundled v4 checkpoint fails to produce the
	documented number within 5 %. If a script doesn't reproduce its claim on
	your machine, please open an issue.

	## What's in this repo

	\| Path \| What it is \|
	\|---\|---\|
	\| `src/tilelli/core/` \| The architecture — 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM \|
	\| `src/tilelli/baselines/vanilla.py` \| The pre-norm transformer used for the A/B comparison \|
	\| `src/tilelli/optimisers/` \| AdamW wrapper + Muon optimizer support \|
	\| `src/tilelli/eval/` \| Metacog probe + scorer (verifies claim 02) \|
	\| `scripts/train.py` \| Master trainer — `--model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}` \|
	\| `scripts/train_demo.py` \| 5-step CPU smoke; verifies the gradient flows \|
	\| `scripts/prepare_tinystories.py` \| Packs raw TinyStories txt → `train.bin`/`valid.bin` \|
	\| `chat.py`, `infer.py` \| Inference entry points (chat uses v4 + KV cache; infer auto-routes) \|
	\| `checkpoints/` \| The two ckpts above \|
	\| `data/tinystories_demo/` \| ~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0) \|
	\| `reproduce/` \| Four claim-verification scripts \|
	\| `results/` \| Verified claim docs + audit trail \|
	\| `prompts/probe_210.jsonl` \| 210-prompt evaluation set across 7 regimes \|
	\| `tests/test_kit_smoke.py` \| Three smoke tests (`pytest -q tests/`) \|

	## What's NOT in this repo

	- Spectrum (power-of-3 7-level quantization) — separate research line in
	the source repo's `mosaic/spinoffs/spectrum/`. Closes ~49 % of the
	ternary→FP32 gap but is still ~12 % behind vanilla FP32. Out of scope here.
	- The FineWeb-Edu training pipeline + the SFT data that produced v4 —
	private. The minimal training loop bundled here trains on any
	`.bin` shards you provide.
	- The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice) — available
	on request via `hello@tilelli.tech` for negative-result replication.

	---

	## The actual interesting finding

	In a small (10 M-param) routed LM, the metacognition / uncertainty signal
	does not live in a separable module. We trained 5 variants (v5–v8b)
	sweeping the metacog-loss weight from 20 → 0, plus a splice (head-only
	graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on `abstain_p`)
	is reached without any explicit metacog loss (v8b, BCE-only) — but at
	the cost of generation quality. The head-only splice preserves generation
	but the signal collapses (AUROC 0.76 → 0.54).

	The signal IS reachable. The module is not liftable. See
	[`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) for the workshop write-up.

	## License

	Apache 2.0. See [`LICENSE`](LICENSE). The bundled weights and the TinyStories
	demo slice ship under the same license (TinyStories is CC-BY-4.0; both
	licenses permit redistribution). The "Tilelli" name is not licensed by this
	file — fork freely; rename if you ship a derivative product.