| # Paper outline β *Metacognition in a small routed LM is not a separable module* |
|
|
| **Status:** outline only (not yet a draft). 4-page workshop format target. |
|
|
| **Candidate venues:** |
| - NeurIPS UnReg / "I Can't Believe It's Not Better" workshop |
| - BlackboxNLP (EMNLP workshop) |
| - ICLR Re-Align / Tiny Papers |
| - arXiv as a short technical report regardless |
|
|
| **Target length:** 4 pages + appendix. ~3,000 words main. |
|
|
| --- |
|
|
| ## 0. Title + abstract (1 paragraph) |
|
|
| > We study whether the gate distribution of a routed language model can |
| > be exploited as a metacognition / uncertainty signal at the smallest |
| > scale where routing is non-trivial (10 M parameters). We pre-registered |
| > a per-regime AUROC decision rule across 7 evaluation regimes and ran |
| > five training variants sweeping the metacog-loss weight from 20 to 0 |
| > plus a head-only weight-graft (splice) condition. The pre-registered |
| > claim is **disproven**: router entropy alone does not beat output-side |
| > baselines in any regime. **A weaker but informative result survives:** |
| > joint router + abstain-head training reaches cross-regime ID-vs-OOD |
| > AUROC up to 0.85 on the abstain head's sigmoid output, but the gain |
| > does not survive a head-only splice onto a fresh base (AUROC drops to |
| > 0.54), and every training configuration that produces the gain also |
| > degrades generation. We argue these two negative results together |
| > bound a substantive claim about modularity: in small routed LMs, the |
| > uncertainty signal lives in the joint {router, head} representation |
| > rather than in the head as a transferable module. |
|
|
| ## 1. Introduction (~ 0.5 page) |
|
|
| - One sentence on why uncertainty heads matter in small/edge models. |
| - Hook: many proposals treat the abstain or uncertainty head as a |
| pluggable module. We test this at small scale and it fails in a |
| specific, mechanism-explainable way. |
| - Three contributions: |
| 1. A pre-registered DISPROVEN claim that router entropy provides |
| metacognition at 10 M params (Section 4). |
| 2. A non-transferability result for abstain heads across base models |
| (Section 5). |
| 3. A mechanism for why joint-training succeeds at signal but breaks |
| generation (Section 6). |
| - All code + ckpts + probe set released under Apache 2.0. |
| |
| ## 2. Setup (~ 0.5 page) |
|
|
| ### 2.1 Model |
| - 10.2 M-parameter byte-level LM, 8 layers, d_model 256. |
| - Each block has 3 pathways: local (1Γ1 conv), sparse attention (top-k), |
| dense FFN. Gate is a learned Linear over hidden state, softmax-routed. |
| - Trained on FineWeb-Edu (~10 B bytes), 12 K base steps, then chat-SFT. |
| |
| ### 2.2 Evaluation regimes |
| - 7 regimes Γ 30 prompts = 210-prompt probe set. |
| - in_domain, ood_topic, ood_style, long_input, gibberish, |
| factual_misleading, neo_false_inability. |
| - Per-prompt signals recorded: max_softmax_mean, max_softmax_last, |
| router_conf, router_entropy_mean, router_entropy_var, |
| router_entropy_per_layer (8-vec), abstain_p. |
| |
| ### 2.3 Pre-registered decision rule (pre-registered in the source repo's MASTER_PLAN_2026-05-23.md) |
| - "Win" = AUROC β₯ 0.02 above the best baseline, bootstrap 95 % CI |
| non-crossing zero, for a given regime. |
| - Wins β₯ 4 incl. one of {gibberish, factual-misleading, NEO} β PROVEN. |
| - 1β3 wins β PARTIAL. |
| - 0 wins β DISPROVEN. |
| |
| ## 3. Baselines (~ 0.3 page) |
| |
| - max_softmax_mean as the output-side baseline; computed over the |
| generated tokens only. |
| - Length-balanced per regime (n = 30 each) so that any signal that |
| correlates with prompt length is controlled. |
| |
| ## 4. The pre-registered router-entropy claim (~ 0.5 page) |
| |
| **Result:** DISPROVEN at strict criterion. Across v3 (baseline) and all |
| joint-finetune variants (v6, v7, v8a, v8b), the router-entropy family |
| wins 0 / 7 regimes by the decision rule. Table 1. |
| |
| **Auxiliary cross-regime AUROC** (the looser test of "does this signal |
| separate in-domain from OOD") tells a different story: it improves |
| substantially under joint training. Save for Section 5. |
| |
| ## 5. Cross-regime AUROC + the splice test (~ 1 page) |
| |
| ### 5.1 Sweep over metacog-loss weight (v7 β v8b) |
| |
| | Variant | metacog wt | abstain wt | abstain_p AUROC | gibberish mean ab_p | in-domain FP @ 0.775 | gen coherent? | |
| |---|---:|---:|---:|---:|---:|---| |
| | v4 (base SFT only) | β | β | 0.51 | 0.60 | 0 % | yes | |
| | v7 | 20 | 1 | 0.76 | 0.94 | 20 % | NO | |
| | v8a | 5 | 1 | 0.80 | 0.97 | 23 % | NO | |
| | **v8b** | **0** | **5** | **0.85** | **1.00** | 10 % | NO | |
| | splice (v4 base + v7 abstain head) | β | β | 0.54 | 0.46 | 27 % | yes (v4-like) | |
| |
| Two findings stand out: |
| |
| 1. The cross-regime signal monotonically *strengthens* as the metacog |
| weight goes to zero. The two losses **compete** for the router's |
| representation budget rather than reinforce each other. |
| 2. The signal does **not survive** a head-only splice. Lifting v7's |
| trained abstain head onto v4's base gives AUROC 0.54 β at chance |
| even though v7 itself reached 0.76. The signal lives in the joint |
| {router perturbation, head} representation. |
| |
| ### 5.2 Why the splice fails (mechanism, ~ 0.3 page) |
| |
| A trained abstain head learns to read patterns in the residual stream |
| that are specific to its training-time co-trained router. The router's |
| shift under joint training reshapes the residual stream; the head reads |
| those reshaped patterns. Lift the head onto a fresh base and the |
| patterns are gone. This is consistent with the literature on feature |
| non-transferability in linear probes (cite). |
| |
| ## 6. The router-fragility mechanism (~ 0.7 page) |
| |
| **Setup:** v8b sets metacog_weight = 0 and abstain_weight = 5. The |
| metacog loss is identically zero β only CE on the in-domain subset and |
| BCE on the abstain head contribute gradient. The only unfrozen |
| parameters are router-Linears + abstain Linear. |
| |
| **Observation:** v8b still breaks generation, sometimes more severely |
| than v7 (which had MC = 20). |
| |
| **Diagnosis:** even with MC = 0, the CE-on-in-domain term backprops |
| through the model's output head into the residual stream and from there |
| into the unfrozen router-Linears. 500 Γ 32 = 16 000 in-domain updates |
| shift the routing distribution enough to break the routing |
| distribution the rest of the (frozen) model was tuned against. OOD |
| generation then collapses. |
| |
| **Falsifiable corollary:** if we additionally freeze the router-Linears |
| during BCE-only training (leave only the abstain Linear trainable), we |
| predict (a) the abstain head still reaches strong cross-regime AUROC |
| because its signal comes from the residual-stream pattern, not from |
| re-routing, and (b) generation is preserved. **This experiment is not |
| in the current paper; queued.** |
| |
| ## 7. Discussion (~ 0.5 page) |
| |
| - What we did NOT show: that this result holds at 100 M or 1 B params. |
| The router-fragility argument is scale-dependent β a larger router |
| with more capacity may absorb 16 K in-domain updates without |
| disrupting OOD routing. We leave this open. |
| - What we DID show, at the scale we tested: |
| 1. The router-entropy-as-metacognition narrative is dead at 10 M. |
| 2. Abstain heads in small routed LMs are not modular. |
| 3. The strongest joint signal is reached by removing the metacog |
| loss, not adding it. |
| - Practical recommendation: at this scale, use `max_softmax_mean` + |
| abstain-aware SFT (not joint finetune). The deployed model uses |
| exactly this configuration and reaches 9 / 10 on the bundled held-out |
| IDK probe (gate β₯ 9; the deploy probe was 10 / 10 on slightly different |
| phrasing) with 0 % in-domain false-positive. |
| |
| ## 8. Related work (~ 0.3 page) |
| |
| - BitNet b1.58 (Microsoft 2025) β ternary base model at scale. |
| - Anthropic features-as-modules β closer to our positive case (features |
| ARE liftable in their analysis). We show this fails for abstain heads |
| in routed-LM setting. |
| - Calibration literature: ECE, temperature scaling, learned uncertainty |
| heads β most work is at 100M+ scale. Our finding is small-scale |
| specific. |
| |
| ## 9. Limitations + reproducibility |
| |
| - 10 M params only. Architecture-specific (3-pathway routed block). |
| - One base ckpt for v8 sweep; another for v4 (history dependence). |
| - Probe set is hand-curated; some prompts may be ambiguous between |
| regimes. Inter-rater reliability not measured. |
| - Cost reproducibility: $0.35 GPU for the v8 sweep; rest CPU. Full kit |
| + scripts at https://github.com/TilelliLab/Tilelli-llm. |
| |
| --- |
| |
| ## What's not in scope |
| |
| - A defense of the 3-pathway block as an architecture. We document the |
| preliminary benchmark in `results/claim_01_benchmark.md` but it is |
| not the headline. |
| - A treatment of the deployed routing-pathway-attribution UI |
| (chat.tilelli.tech). That's a system + UX contribution best suited |
| for a separate venue (HCI/demo). |
| |
| ## Appendix sketch |
| |
| - A1: full 7Γ7 AUROC Γ variant matrix |
| - A2: sample generations for all 5 variants Γ 5 representative prompts |
| - A3: training-curve plots (ab_gap, ent_gap, ce) for v7 / v8a / v8b |
| - A4: the 210-prompt probe set as a CSV |
| - A5: ckpts + SHAs of all variants |
| |
| ## Timing |
| |
| - Write 1st draft: 2 days |
| - Send to 2 reviewers: 1 week |
| - Revise + submit: 1 week |
| - **Target ready-for-submission:** 2026-06-10 |
| |