Tilelli-llm / PAPER_OUTLINE.md

Mirror small files (code, paper, results)

f86dc09 verified 15 days ago

9.03 kB

	# Paper outline — Metacognition in a small routed LM is not a separable module

	Status: outline only (not yet a draft). 4-page workshop format target.

	Candidate venues:
	- NeurIPS UnReg / "I Can't Believe It's Not Better" workshop
	- BlackboxNLP (EMNLP workshop)
	- ICLR Re-Align / Tiny Papers
	- arXiv as a short technical report regardless

	Target length: 4 pages + appendix. ~3,000 words main.

	---

	## 0. Title + abstract (1 paragraph)

	> We study whether the gate distribution of a routed language model can
	> be exploited as a metacognition / uncertainty signal at the smallest
	> scale where routing is non-trivial (10 M parameters). We pre-registered
	> a per-regime AUROC decision rule across 7 evaluation regimes and ran
	> five training variants sweeping the metacog-loss weight from 20 to 0
	> plus a head-only weight-graft (splice) condition. The pre-registered
	> claim is disproven: router entropy alone does not beat output-side
	> baselines in any regime. A weaker but informative result survives:
	> joint router + abstain-head training reaches cross-regime ID-vs-OOD
	> AUROC up to 0.85 on the abstain head's sigmoid output, but the gain
	> does not survive a head-only splice onto a fresh base (AUROC drops to
	> 0.54), and every training configuration that produces the gain also
	> degrades generation. We argue these two negative results together
	> bound a substantive claim about modularity: in small routed LMs, the
	> uncertainty signal lives in the joint {router, head} representation
	> rather than in the head as a transferable module.

	## 1. Introduction (~ 0.5 page)

	- One sentence on why uncertainty heads matter in small/edge models.
	- Hook: many proposals treat the abstain or uncertainty head as a
	pluggable module. We test this at small scale and it fails in a
	specific, mechanism-explainable way.
	- Three contributions:
	1. A pre-registered DISPROVEN claim that router entropy provides
	metacognition at 10 M params (Section 4).
	2. A non-transferability result for abstain heads across base models
	(Section 5).
	3. A mechanism for why joint-training succeeds at signal but breaks
	generation (Section 6).
	- All code + ckpts + probe set released under Apache 2.0.

	## 2. Setup (~ 0.5 page)

	### 2.1 Model
	- 10.2 M-parameter byte-level LM, 8 layers, d_model 256.
	- Each block has 3 pathways: local (1×1 conv), sparse attention (top-k),
	dense FFN. Gate is a learned Linear over hidden state, softmax-routed.
	- Trained on FineWeb-Edu (~10 B bytes), 12 K base steps, then chat-SFT.

	### 2.2 Evaluation regimes
	- 7 regimes × 30 prompts = 210-prompt probe set.
	- in_domain, ood_topic, ood_style, long_input, gibberish,
	factual_misleading, neo_false_inability.
	- Per-prompt signals recorded: max_softmax_mean, max_softmax_last,
	router_conf, router_entropy_mean, router_entropy_var,
	router_entropy_per_layer (8-vec), abstain_p.

	### 2.3 Pre-registered decision rule (pre-registered in the source repo's MASTER_PLAN_2026-05-23.md)
	- "Win" = AUROC ≥ 0.02 above the best baseline, bootstrap 95 % CI
	non-crossing zero, for a given regime.
	- Wins ≥ 4 incl. one of {gibberish, factual-misleading, NEO} → PROVEN.
	- 1–3 wins → PARTIAL.
	- 0 wins → DISPROVEN.

	## 3. Baselines (~ 0.3 page)

	- max_softmax_mean as the output-side baseline; computed over the
	generated tokens only.
	- Length-balanced per regime (n = 30 each) so that any signal that
	correlates with prompt length is controlled.

	## 4. The pre-registered router-entropy claim (~ 0.5 page)

	Result: DISPROVEN at strict criterion. Across v3 (baseline) and all
	joint-finetune variants (v6, v7, v8a, v8b), the router-entropy family
	wins 0 / 7 regimes by the decision rule. Table 1.

	Auxiliary cross-regime AUROC (the looser test of "does this signal
	separate in-domain from OOD") tells a different story: it improves
	substantially under joint training. Save for Section 5.

	## 5. Cross-regime AUROC + the splice test (~ 1 page)

	### 5.1 Sweep over metacog-loss weight (v7 → v8b)

	\| Variant \| metacog wt \| abstain wt \| abstain_p AUROC \| gibberish mean ab_p \| in-domain FP @ 0.775 \| gen coherent? \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---\|
	\| v4 (base SFT only) \| – \| – \| 0.51 \| 0.60 \| 0 % \| yes \|
	\| v7 \| 20 \| 1 \| 0.76 \| 0.94 \| 20 % \| NO \|
	\| v8a \| 5 \| 1 \| 0.80 \| 0.97 \| 23 % \| NO \|
	\| v8b \| 0 \| 5 \| 0.85 \| 1.00 \| 10 % \| NO \|
	\| splice (v4 base + v7 abstain head) \| – \| – \| 0.54 \| 0.46 \| 27 % \| yes (v4-like) \|

	Two findings stand out:

	1. The cross-regime signal monotonically strengthens as the metacog
	weight goes to zero. The two losses compete for the router's
	representation budget rather than reinforce each other.
	2. The signal does not survive a head-only splice. Lifting v7's
	trained abstain head onto v4's base gives AUROC 0.54 — at chance
	even though v7 itself reached 0.76. The signal lives in the joint
	{router perturbation, head} representation.

	### 5.2 Why the splice fails (mechanism, ~ 0.3 page)

	A trained abstain head learns to read patterns in the residual stream
	that are specific to its training-time co-trained router. The router's
	shift under joint training reshapes the residual stream; the head reads
	those reshaped patterns. Lift the head onto a fresh base and the
	patterns are gone. This is consistent with the literature on feature
	non-transferability in linear probes (cite).

	## 6. The router-fragility mechanism (~ 0.7 page)

	Setup: v8b sets metacog_weight = 0 and abstain_weight = 5. The
	metacog loss is identically zero — only CE on the in-domain subset and
	BCE on the abstain head contribute gradient. The only unfrozen
	parameters are router-Linears + abstain Linear.

	Observation: v8b still breaks generation, sometimes more severely
	than v7 (which had MC = 20).

	Diagnosis: even with MC = 0, the CE-on-in-domain term backprops
	through the model's output head into the residual stream and from there
	into the unfrozen router-Linears. 500 × 32 = 16 000 in-domain updates
	shift the routing distribution enough to break the routing
	distribution the rest of the (frozen) model was tuned against. OOD
	generation then collapses.

	Falsifiable corollary: if we additionally freeze the router-Linears
	during BCE-only training (leave only the abstain Linear trainable), we
	predict (a) the abstain head still reaches strong cross-regime AUROC
	because its signal comes from the residual-stream pattern, not from
	re-routing, and (b) generation is preserved. **This experiment is not
	in the current paper; queued.**

	## 7. Discussion (~ 0.5 page)

	- What we did NOT show: that this result holds at 100 M or 1 B params.
	The router-fragility argument is scale-dependent — a larger router
	with more capacity may absorb 16 K in-domain updates without
	disrupting OOD routing. We leave this open.
	- What we DID show, at the scale we tested:
	1. The router-entropy-as-metacognition narrative is dead at 10 M.
	2. Abstain heads in small routed LMs are not modular.
	3. The strongest joint signal is reached by removing the metacog
	loss, not adding it.
	- Practical recommendation: at this scale, use `max_softmax_mean` +
	abstain-aware SFT (not joint finetune). The deployed model uses
	exactly this configuration and reaches 9 / 10 on the bundled held-out
	IDK probe (gate ≥ 9; the deploy probe was 10 / 10 on slightly different
	phrasing) with 0 % in-domain false-positive.

	## 8. Related work (~ 0.3 page)

	- BitNet b1.58 (Microsoft 2025) — ternary base model at scale.
	- Anthropic features-as-modules — closer to our positive case (features
	ARE liftable in their analysis). We show this fails for abstain heads
	in routed-LM setting.
	- Calibration literature: ECE, temperature scaling, learned uncertainty
	heads — most work is at 100M+ scale. Our finding is small-scale
	specific.

	## 9. Limitations + reproducibility

	- 10 M params only. Architecture-specific (3-pathway routed block).
	- One base ckpt for v8 sweep; another for v4 (history dependence).
	- Probe set is hand-curated; some prompts may be ambiguous between
	regimes. Inter-rater reliability not measured.
	- Cost reproducibility: $0.35 GPU for the v8 sweep; rest CPU. Full kit
	+ scripts at https://github.com/TilelliLab/Tilelli-llm.

	---

	## What's not in scope

	- A defense of the 3-pathway block as an architecture. We document the
	preliminary benchmark in `results/claim_01_benchmark.md` but it is
	not the headline.
	- A treatment of the deployed routing-pathway-attribution UI
	(chat.tilelli.tech). That's a system + UX contribution best suited
	for a separate venue (HCI/demo).

	## Appendix sketch

	- A1: full 7×7 AUROC × variant matrix
	- A2: sample generations for all 5 variants × 5 representative prompts
	- A3: training-curve plots (ab_gap, ent_gap, ce) for v7 / v8a / v8b
	- A4: the 210-prompt probe set as a CSV
	- A5: ckpts + SHAs of all variants

	## Timing

	- Write 1st draft: 2 days
	- Send to 2 reviewers: 1 week
	- Revise + submit: 1 week
	- Target ready-for-submission: 2026-06-10