Mirror small files (code, paper, results)

f86dc09 verified 15 days ago

12.9 kB

	# Metacognition in a Small Routed Language Model Is Not a Separable Module

	Tilelli LLM Team · hello@tilelli.tech
	Code, checkpoints, and the evaluation set: https://github.com/TilelliLab/Tilelli-llm (Apache-2.0)

	*Draft — workshop format (4 pages + appendix). Every number in this paper is produced by a
	script in `reproduce/` that exits non-zero if the bundled checkpoint fails to reproduce it
	within tolerance.*

	---

	## Abstract

	We study whether the gate distribution of a routed language model can be exploited as a
	metacognition / uncertainty signal at the smallest scale where routing is non-trivial
	(10.2 M parameters). We pre-registered a per-regime AUROC decision rule across 7 evaluation
	regimes and ran five training variants sweeping the metacognition-loss weight from 20 to 0,
	plus a head-only weight-graft ("splice") condition. The pre-registered claim is disproven:
	router entropy alone does not beat an output-side baseline in any of the 7 regimes. A weaker
	but informative result survives: joint router + abstain-head training reaches cross-regime
	in-domain-vs-OOD AUROC up to 0.85 on the abstain head's sigmoid output, but (i) the gain does
	not survive a head-only splice onto a fresh base (AUROC drops to 0.54, at chance), and (ii)
	every configuration that produces the gain also degrades generation. We argue these two
	negative results together bound a substantive claim about modularity: in small routed LMs the
	uncertainty signal lives in the joint {router, head} representation rather than in the head as a
	transferable module. We further isolate the mechanism — at this scale the router is fragile
	enough that cross-entropy backprop on an in-domain subset alone, with the metacognition loss set
	identically to zero, shifts the routing distribution enough to break out-of-domain generation.

	---

	## 1. Introduction

	Uncertainty and abstention heads are increasingly proposed as pluggable modules: train a small
	head to predict "I don't know," and bolt it onto a base model. This paper tests that modularity
	assumption at the small/edge scale where it would matter most, using a 10.2 M-parameter routed
	byte-level LM, and finds it fails in a specific, mechanism-explainable way.

	We make three contributions, all negative or qualifying, and all reproducible:

	1. A pre-registered, disproven claim that router entropy provides metacognition at 10 M
	parameters (Section 4).
	2. A non-transferability result for abstain heads across base models — a head that reaches
	AUROC 0.76 in situ drops to 0.54 when lifted onto a fresh base (Section 5).
	3. A mechanism for why joint training succeeds at producing the signal but breaks
	generation, including a falsifiable corollary (Section 6).

	We deliberately do not headline an architecture win. A preliminary single-seed benchmark of the
	3-pathway block against a vanilla decoder is reported honestly in Section 3 and
	`results/claim_01_benchmark.md`, and it is not a defensible result; we say so plainly rather
	than promote it.

	## 2. Setup

	### 2.1 Model

	A 10.2 M-parameter byte-level language model: 8 layers, `d_model = 256`. Each block contains
	three parallel pathways — a local pathway (1×1 convolution), a sparse-attention pathway (top-k),
	and a dense feed-forward pathway — mixed by a learned linear gate over the hidden state,
	softmax-routed. The model was trained on FineWeb-Edu (~10 B bytes) for 12 K base steps, then
	chat-SFT, then abstain-aware SFT. The deployed checkpoint (`tilelli_chat_v4.pt`, FP32,
	unquantized) anchors every positive claim in this paper.

	### 2.2 Evaluation regimes

	We hand-curated 7 regimes × 30 prompts = a 210-prompt probe set
	(`prompts/probe_210.jsonl`): `in_domain`, `ood_topic`, `ood_style`, `long_input`, `gibberish`,
	`factual_misleading`, and `neo_false_inability` (well-formed prompts that invite a spurious
	refusal). For each prompt we record output-side and routing-side signals: `max_softmax_mean` and
	`max_softmax_last` (output-side baselines), `router_conf`, `router_entropy_mean`,
	`router_entropy_var`, the 8-vector `router_entropy_per_layer`, and `abstain_p` (the sigmoid of a
	dedicated abstain head on the final hidden state).

	### 2.3 Pre-registered decision rule

	Registered before the runs (`MASTER_PLAN_2026-05-23.md` in the source repo). A win in a regime
	requires AUROC ≥ 0.02 above the best baseline with a bootstrap 95% CI not crossing zero.
	≥ 4 wins including at least one of {gibberish, factual-misleading, NEO} → PROVEN;
	1–3 wins → PARTIAL; 0 wins → DISPROVEN.

	## 3. Baselines and a benchmark we will not oversell

	The output-side baseline is `max_softmax_mean`, computed over generated tokens only, with each
	regime length-balanced (n = 30) so a length-correlated signal cannot masquerade as
	metacognition.

	Separately, on the architecture itself: on the only Tilelli-Lite seed whose training log is
	preserved (seed 1234), the comparison against a single-seed vanilla decoder **depends on the
	metric*. On the within-training periodic-eval best, Lite loses* by 0.6% (0.5742 vs 0.5707); on
	a single post-training extra-eval pass, Lite wins by 0.4% (0.5685) — on a different
	validation-batch composition, with a different `eval_every` than the baseline. Two further Lite
	seeds were reported (0.5679, 0.5693) but their logs are not preserved. We therefore treat this as
	a preliminary single-seed directional finding, not a result (`results/claim_01_benchmark.md`),
	and it is not load-bearing for anything below.

	## 4. The pre-registered router-entropy claim: disproven

	Across the v3 baseline and all joint-finetune variants (v6, v7, v8a, v8b), the router-entropy
	family wins 0 / 7 regimes by the registered criterion. The pre-registered claim is
	DISPROVEN.

	Cross-regime in-domain-vs-OOD AUROC on the deployed v4 checkpoint:

	\| Signal \| Cross-regime AUROC \|
	\|---\|---:\|
	\| `max_softmax_mean` (baseline) \| 0.54 \|
	\| `router_conf` \| 0.55 \|
	\| `router_entropy_mean` \| 0.55 \|
	\| `abstain_p` \| 0.51 \|

	All four are essentially at chance for cross-regime discrimination. The one place an
	uncertainty signal works at this scale is `max_softmax_mean` broken down per regime: on
	gibberish-vs-in-domain it reaches AUROC ≈ 0.93. That signal is output-side and
	architecture-agnostic — it would work on any vanilla LM — so it offers no support for the
	routing-as-metacognition narrative.

	## 5. Cross-regime AUROC and the splice test

	A looser question — does any signal separate in-domain from OOD after joint training? — has a
	more interesting answer. We swept the metacognition-loss weight from 20 → 5 → 0 while keeping an
	abstain BCE term:

	\| Variant \| metacog wt \| abstain wt \| `abstain_p` AUROC \| gibberish mean `abstain_p` \| in-domain FP @ 0.775 \| generation coherent? \|
	\|---\|---:\|---:\|---:\|---:\|---:\|:--:\|
	\| v4 (base SFT only) \| – \| – \| 0.51 \| 0.60 \| 0% \| yes \|
	\| v7 \| 20 \| 1 \| 0.76 \| 0.94 \| 20% \| no \|
	\| v8a \| 5 \| 1 \| 0.80 \| 0.97 \| 23% \| no \|
	\| v8b \| 0 \| 5 \| 0.85 \| 1.00 \| 10% \| no \|
	\| splice (v4 base + v7 head) \| – \| – \| 0.54 \| 0.46 \| 27% \| yes (v4-like) \|

	Two findings stand out.

	(1) The losses compete; they do not synergize. The cross-regime signal *strengthens
	monotonically as the metacognition weight goes to zero*. v8b, with zero metacognition pressure,
	produces the strongest abstain signal in the entire project (AUROC 0.85, gibberish mean 1.00).
	Adding the metacognition loss makes the discrimination worse, not better — the two losses
	contend for the router's limited representation budget.

	(2) The signal does not survive a head-only splice. Lifting v7's trained abstain head onto
	v4's frozen base gives AUROC 0.54 — at chance, despite v7 itself reaching 0.76 — and makes
	behavior worse, not neutral, raising the in-domain false-positive rate to 27%:

	\| Deploy gate \| v4 \| splice \| v7 \|
	\|---\|---:\|---:\|---:\|
	\| gibberish mean `abstain_p` (target > 0.775) \| 0.60 ✗ \| 0.46 ✗ \| 0.94 ✓ \|
	\| in-domain false-positive rate (target ≤ 0%) \| 0% \| 27% \| 20% \|
	\| chat coherence \| ✓ \| ✓ (v4-like) \| ✗ broken \|

	### 5.1 Why the splice fails

	A trained abstain head learns to read residual-stream patterns specific to its co-trained router.
	Joint training shifts the router, which reshapes the residual stream; the head reads those
	reshaped patterns. Lift the head onto a fresh base and the patterns are gone — consistent with
	the literature on feature non-transferability in linear probes. The uncertainty signal is a
	property of the joint {router-perturbation, head} representation, not of the head alone.

	## 6. The router-fragility mechanism

	v8b sets the metacognition weight to exactly zero: only cross-entropy on the in-domain subset and
	BCE on the abstain head contribute gradient, and the only unfrozen parameters are the router
	linears plus the abstain linear. v8b still breaks generation — sometimes more severely than
	v7, which had a metacognition weight of 20.

	Diagnosis: even with the metacognition loss identically zero, the in-domain cross-entropy term
	backprops through the output head into the residual stream and from there into the unfrozen router
	linears. Roughly 16,000 in-domain updates (500 steps × 32) shift the routing distribution enough
	to break the routing the rest of the (frozen) model was tuned against; OOD generation then
	collapses. At this scale the router cannot be retrained on any subset distribution without
	disrupting generation elsewhere.

	Falsifiable corollary (queued, not yet run): additionally freeze the router linears and train
	only the abstain linear under BCE. We predict (a) the abstain head still reaches strong
	cross-regime AUROC, because its signal comes from the residual-stream pattern rather than from
	re-routing, and (b) generation is preserved. Confirmation would localize the damage precisely to
	router re-tuning.

	## 7. The deployed operating point (what actually works)

	The practical recommendation at this scale is not joint finetuning: it is `max_softmax_mean`
	plus abstain-aware SFT. The deployed v4 checkpoint, using exactly that recipe, reaches 9 / 10
	on the bundled held-out "I don't know" gate (PASS gate ≥ 9; the deploy probe was 10 / 10 on
	slightly different phrasing) with a 0% in-domain false-positive rate at threshold 0.775
	(calibrated on held-out data). On a separate false-inability probe it fires the refusal template
	on 7 / 20 answerable prompts — precision-bounded by SFT coverage. These are precision claims
	about a head working on its trained pattern, not generalization claims; on semantic OOD outside
	the SFT distribution the same head is at chance (Section 4).

	## 8. Discussion

	What we did not show: that any of this holds at 100 M or 1 B parameters. The router-fragility
	argument is explicitly scale-dependent — a larger router with more capacity may absorb in-domain
	updates without disrupting OOD routing. We leave that open. What we did show, at the scale we
	tested: (1) the router-entropy-as-metacognition narrative is dead at 10 M; (2) abstain heads in
	small routed LMs are not modular; (3) the strongest joint signal is reached by removing the
	metacognition loss, not adding it.

	## 9. Related work

	Ternary base models at scale (e.g. BitNet b1.58) motivate small-model interest but do not address
	modular uncertainty. Work treating sparse features as liftable modules is closer to our positive
	counterexample — we show the lifting fails for abstain heads in the routed-LM setting. Most
	calibration work (ECE, temperature scaling, learned uncertainty heads) operates at 100 M+ scale;
	our finding is small-scale specific.

	## 10. Limitations and reproducibility

	10.2 M parameters only; architecture-specific (3-pathway routed block). The v8 sweep uses one
	base checkpoint and v4 another (history dependence). The probe set is hand-curated and
	inter-rater reliability is not measured. Cost: ~$0.35 of GPU for the v8 sweep, the rest CPU.
	Every headline number is bound to a script:

	```bash
	python reproduce/01_benchmark.py # arch loads, ~10 M params (CPU, ~2 s)
	python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min)
	python reproduce/04_neo_false_inability.py # 7 / 20 false-inability (CPU, ~2 min)
	python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min)
	```

	Each exits non-zero if the bundled v4 checkpoint fails to produce the documented number within
	tolerance.

	## Appendix (sketch)

	- A1 Full 7-regime × variant AUROC matrix.
	- A2 Sample generations for all 5 variants on 5 representative prompts.
	- A3 Training curves (abstain gap, entropy gap, CE) for v7 / v8a / v8b.
	- A4 The 210-prompt probe set (`prompts/probe_210.jsonl`).
	- A5 Checkpoints and SHAs for all variants (negative-result checkpoints available on request
	via hello@tilelli.tech).