| # Metacognition in a Small Routed Language Model Is Not a Separable Module |
|
|
| **Tilelli LLM Team** Β· hello@tilelli.tech |
| Code, checkpoints, and the evaluation set: https://github.com/TilelliLab/Tilelli-llm (Apache-2.0) |
|
|
| *Draft β workshop format (4 pages + appendix). Every number in this paper is produced by a |
| script in `reproduce/` that exits non-zero if the bundled checkpoint fails to reproduce it |
| within tolerance.* |
|
|
| --- |
|
|
| ## Abstract |
|
|
| We study whether the gate distribution of a routed language model can be exploited as a |
| metacognition / uncertainty signal at the smallest scale where routing is non-trivial |
| (10.2 M parameters). We pre-registered a per-regime AUROC decision rule across 7 evaluation |
| regimes and ran five training variants sweeping the metacognition-loss weight from 20 to 0, |
| plus a head-only weight-graft ("splice") condition. **The pre-registered claim is disproven:** |
| router entropy alone does not beat an output-side baseline in any of the 7 regimes. A weaker |
| but informative result survives: joint router + abstain-head training reaches cross-regime |
| in-domain-vs-OOD AUROC up to 0.85 on the abstain head's sigmoid output, but (i) the gain does |
| not survive a head-only splice onto a fresh base (AUROC drops to 0.54, at chance), and (ii) |
| every configuration that produces the gain also degrades generation. We argue these two |
| negative results together bound a substantive claim about modularity: in small routed LMs the |
| uncertainty signal lives in the joint {router, head} representation rather than in the head as a |
| transferable module. We further isolate the mechanism β at this scale the router is fragile |
| enough that cross-entropy backprop on an in-domain subset alone, with the metacognition loss set |
| identically to zero, shifts the routing distribution enough to break out-of-domain generation. |
|
|
| --- |
|
|
| ## 1. Introduction |
|
|
| Uncertainty and abstention heads are increasingly proposed as pluggable modules: train a small |
| head to predict "I don't know," and bolt it onto a base model. This paper tests that modularity |
| assumption at the small/edge scale where it would matter most, using a 10.2 M-parameter routed |
| byte-level LM, and finds it fails in a specific, mechanism-explainable way. |
|
|
| We make three contributions, all negative or qualifying, and all reproducible: |
|
|
| 1. A **pre-registered, disproven** claim that router entropy provides metacognition at 10 M |
| parameters (Section 4). |
| 2. A **non-transferability** result for abstain heads across base models β a head that reaches |
| AUROC 0.76 in situ drops to 0.54 when lifted onto a fresh base (Section 5). |
| 3. A **mechanism** for why joint training succeeds at producing the signal but breaks |
| generation, including a falsifiable corollary (Section 6). |
|
|
| We deliberately do not headline an architecture win. A preliminary single-seed benchmark of the |
| 3-pathway block against a vanilla decoder is reported honestly in Section 3 and |
| `results/claim_01_benchmark.md`, and it is **not** a defensible result; we say so plainly rather |
| than promote it. |
|
|
| ## 2. Setup |
|
|
| ### 2.1 Model |
|
|
| A 10.2 M-parameter byte-level language model: 8 layers, `d_model = 256`. Each block contains |
| three parallel pathways β a local pathway (1Γ1 convolution), a sparse-attention pathway (top-k), |
| and a dense feed-forward pathway β mixed by a learned linear gate over the hidden state, |
| softmax-routed. The model was trained on FineWeb-Edu (~10 B bytes) for 12 K base steps, then |
| chat-SFT, then abstain-aware SFT. The deployed checkpoint (`tilelli_chat_v4.pt`, FP32, |
| unquantized) anchors every positive claim in this paper. |
|
|
| ### 2.2 Evaluation regimes |
|
|
| We hand-curated 7 regimes Γ 30 prompts = a 210-prompt probe set |
| (`prompts/probe_210.jsonl`): `in_domain`, `ood_topic`, `ood_style`, `long_input`, `gibberish`, |
| `factual_misleading`, and `neo_false_inability` (well-formed prompts that invite a spurious |
| refusal). For each prompt we record output-side and routing-side signals: `max_softmax_mean` and |
| `max_softmax_last` (output-side baselines), `router_conf`, `router_entropy_mean`, |
| `router_entropy_var`, the 8-vector `router_entropy_per_layer`, and `abstain_p` (the sigmoid of a |
| dedicated abstain head on the final hidden state). |
|
|
| ### 2.3 Pre-registered decision rule |
|
|
| Registered before the runs (`MASTER_PLAN_2026-05-23.md` in the source repo). A *win* in a regime |
| requires AUROC β₯ 0.02 above the best baseline with a bootstrap 95% CI not crossing zero. |
| **β₯ 4 wins** including at least one of {gibberish, factual-misleading, NEO} β PROVEN; |
| **1β3 wins** β PARTIAL; **0 wins** β DISPROVEN. |
|
|
| ## 3. Baselines and a benchmark we will not oversell |
|
|
| The output-side baseline is `max_softmax_mean`, computed over generated tokens only, with each |
| regime length-balanced (n = 30) so a length-correlated signal cannot masquerade as |
| metacognition. |
|
|
| Separately, on the architecture itself: on the only Tilelli-Lite seed whose training log is |
| preserved (seed 1234), the comparison against a single-seed vanilla decoder **depends on the |
| metric**. On the within-training periodic-eval best, Lite *loses* by 0.6% (0.5742 vs 0.5707); on |
| a single post-training extra-eval pass, Lite *wins* by 0.4% (0.5685) β on a different |
| validation-batch composition, with a different `eval_every` than the baseline. Two further Lite |
| seeds were reported (0.5679, 0.5693) but their logs are not preserved. We therefore treat this as |
| a **preliminary single-seed directional finding, not a result** (`results/claim_01_benchmark.md`), |
| and it is not load-bearing for anything below. |
|
|
| ## 4. The pre-registered router-entropy claim: disproven |
|
|
| Across the v3 baseline and all joint-finetune variants (v6, v7, v8a, v8b), the router-entropy |
| family wins **0 / 7 regimes** by the registered criterion. The pre-registered claim is |
| **DISPROVEN**. |
|
|
| Cross-regime in-domain-vs-OOD AUROC on the deployed v4 checkpoint: |
|
|
| | Signal | Cross-regime AUROC | |
| |---|---:| |
| | `max_softmax_mean` (baseline) | 0.54 | |
| | `router_conf` | 0.55 | |
| | `router_entropy_mean` | 0.55 | |
| | `abstain_p` | 0.51 | |
|
|
| All four are essentially at chance for *cross-regime* discrimination. The one place an |
| uncertainty signal works at this scale is `max_softmax_mean` *broken down per regime*: on |
| gibberish-vs-in-domain it reaches AUROC β 0.93. That signal is output-side and |
| architecture-agnostic β it would work on any vanilla LM β so it offers no support for the |
| routing-as-metacognition narrative. |
|
|
| ## 5. Cross-regime AUROC and the splice test |
|
|
| A looser question β does any signal separate in-domain from OOD after *joint* training? β has a |
| more interesting answer. We swept the metacognition-loss weight from 20 β 5 β 0 while keeping an |
| abstain BCE term: |
|
|
| | Variant | metacog wt | abstain wt | `abstain_p` AUROC | gibberish mean `abstain_p` | in-domain FP @ 0.775 | generation coherent? | |
| |---|---:|---:|---:|---:|---:|:--:| |
| | v4 (base SFT only) | β | β | 0.51 | 0.60 | 0% | yes | |
| | v7 | 20 | 1 | 0.76 | 0.94 | 20% | no | |
| | v8a | 5 | 1 | 0.80 | 0.97 | 23% | no | |
| | **v8b** | **0** | **5** | **0.85** | **1.00** | 10% | no | |
| | splice (v4 base + v7 head) | β | β | 0.54 | 0.46 | 27% | yes (v4-like) | |
|
|
| Two findings stand out. |
|
|
| **(1) The losses compete; they do not synergize.** The cross-regime signal *strengthens |
| monotonically as the metacognition weight goes to zero*. v8b, with zero metacognition pressure, |
| produces the strongest abstain signal in the entire project (AUROC 0.85, gibberish mean 1.00). |
| Adding the metacognition loss makes the discrimination *worse*, not better β the two losses |
| contend for the router's limited representation budget. |
|
|
| **(2) The signal does not survive a head-only splice.** Lifting v7's trained abstain head onto |
| v4's frozen base gives AUROC 0.54 β at chance, despite v7 itself reaching 0.76 β and makes |
| behavior *worse*, not neutral, raising the in-domain false-positive rate to 27%: |
|
|
| | Deploy gate | v4 | splice | v7 | |
| |---|---:|---:|---:| |
| | gibberish mean `abstain_p` (target > 0.775) | 0.60 β | 0.46 β | 0.94 β | |
| | in-domain false-positive rate (target β€ 0%) | 0% | 27% | 20% | |
| | chat coherence | β | β (v4-like) | β broken | |
|
|
| ### 5.1 Why the splice fails |
|
|
| A trained abstain head learns to read residual-stream patterns specific to its co-trained router. |
| Joint training shifts the router, which reshapes the residual stream; the head reads those |
| reshaped patterns. Lift the head onto a fresh base and the patterns are gone β consistent with |
| the literature on feature non-transferability in linear probes. The uncertainty signal is a |
| property of the joint {router-perturbation, head} representation, not of the head alone. |
|
|
| ## 6. The router-fragility mechanism |
|
|
| v8b sets the metacognition weight to exactly zero: only cross-entropy on the in-domain subset and |
| BCE on the abstain head contribute gradient, and the only unfrozen parameters are the router |
| linears plus the abstain linear. **v8b still breaks generation** β sometimes more severely than |
| v7, which had a metacognition weight of 20. |
|
|
| Diagnosis: even with the metacognition loss identically zero, the in-domain cross-entropy term |
| backprops through the output head into the residual stream and from there into the unfrozen router |
| linears. Roughly 16,000 in-domain updates (500 steps Γ 32) shift the routing distribution enough |
| to break the routing the rest of the (frozen) model was tuned against; OOD generation then |
| collapses. At this scale the router cannot be retrained on *any* subset distribution without |
| disrupting generation elsewhere. |
|
|
| **Falsifiable corollary (queued, not yet run):** additionally freeze the router linears and train |
| only the abstain linear under BCE. We predict (a) the abstain head still reaches strong |
| cross-regime AUROC, because its signal comes from the residual-stream pattern rather than from |
| re-routing, and (b) generation is preserved. Confirmation would localize the damage precisely to |
| router re-tuning. |
|
|
| ## 7. The deployed operating point (what actually works) |
|
|
| The practical recommendation at this scale is **not** joint finetuning: it is `max_softmax_mean` |
| plus abstain-aware SFT. The deployed v4 checkpoint, using exactly that recipe, reaches **9 / 10** |
| on the bundled held-out "I don't know" gate (PASS gate β₯ 9; the deploy probe was 10 / 10 on |
| slightly different phrasing) with a **0%** in-domain false-positive rate at threshold 0.775 |
| (calibrated on held-out data). On a separate false-inability probe it fires the refusal template |
| on **7 / 20** answerable prompts β precision-bounded by SFT coverage. These are precision claims |
| about a head working on its trained pattern, not generalization claims; on semantic OOD outside |
| the SFT distribution the same head is at chance (Section 4). |
|
|
| ## 8. Discussion |
|
|
| What we did **not** show: that any of this holds at 100 M or 1 B parameters. The router-fragility |
| argument is explicitly scale-dependent β a larger router with more capacity may absorb in-domain |
| updates without disrupting OOD routing. We leave that open. What we **did** show, at the scale we |
| tested: (1) the router-entropy-as-metacognition narrative is dead at 10 M; (2) abstain heads in |
| small routed LMs are not modular; (3) the strongest joint signal is reached by *removing* the |
| metacognition loss, not adding it. |
|
|
| ## 9. Related work |
|
|
| Ternary base models at scale (e.g. BitNet b1.58) motivate small-model interest but do not address |
| modular uncertainty. Work treating sparse features as liftable modules is closer to our positive |
| counterexample β we show the lifting fails for abstain heads in the routed-LM setting. Most |
| calibration work (ECE, temperature scaling, learned uncertainty heads) operates at 100 M+ scale; |
| our finding is small-scale specific. |
|
|
| ## 10. Limitations and reproducibility |
|
|
| 10.2 M parameters only; architecture-specific (3-pathway routed block). The v8 sweep uses one |
| base checkpoint and v4 another (history dependence). The probe set is hand-curated and |
| inter-rater reliability is not measured. Cost: ~$0.35 of GPU for the v8 sweep, the rest CPU. |
| Every headline number is bound to a script: |
|
|
| ```bash |
| python reproduce/01_benchmark.py # arch loads, ~10 M params (CPU, ~2 s) |
| python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min) |
| python reproduce/04_neo_false_inability.py # 7 / 20 false-inability (CPU, ~2 min) |
| python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min) |
| ``` |
|
|
| Each exits non-zero if the bundled v4 checkpoint fails to produce the documented number within |
| tolerance. |
|
|
| ## Appendix (sketch) |
|
|
| - **A1** Full 7-regime Γ variant AUROC matrix. |
| - **A2** Sample generations for all 5 variants on 5 representative prompts. |
| - **A3** Training curves (abstain gap, entropy gap, CE) for v7 / v8a / v8b. |
| - **A4** The 210-prompt probe set (`prompts/probe_210.jsonl`). |
| - **A5** Checkpoints and SHAs for all variants (negative-result checkpoints available on request |
| via hello@tilelli.tech). |
|
|