# Self-Distillation Landscape Audit (feeds ADR-007) **Status:** research note, pre-experimental **Author:** subagent audit **Date:** 2026-05-25 **Scope:** identify 2–3 distillation-channel losses worth adding to `composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` + multi-teacher trace-replay DPO stack. **Bias:** additivity over novelty. We are looking for losses that COMPOSE with what is already implemented, not duplicates of it. --- ## TL;DR — recommended additions | Rank | Method | Loss role | License | LOC est. | Why it composes | |------|--------|-----------|---------|----------|-----------------| | 1 | **SimPO** (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel | | 2 | **TAID** (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students | | 3 | **Entropy-Aware OPD** (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 | **Honourable mention:** KTO — useful only if the framework wants to ingest binary thumbs-up/thumbs-down trace signals without preference pairs. **Not recommended:** GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end). --- ## Audit method For each candidate paper (the seven the user named, plus 2026 follow-ups discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`) we verified: 1. **Primary source exists.** arXiv abstract page reachable; HTML body parsed to extract the actual loss formula (not summarised from secondary sources). 2. **Code is real.** Official repo's README was fetched, `last push` date and star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained were marked as such. 3. **License is permissive enough.** MIT, Apache-2.0, BSD, CC BY 4.0 are acceptable for inclusion. GPL or research-only would be flagged. 4. **Composability check.** Read the framework's existing `composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`, then asked: *does this loss replace something we have, or stack on top?* --- ## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED ### Sources - **arXiv:** https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024) - **GitHub:** https://github.com/princeton-nlp/SimPO - License: **MIT** - 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS) - Built on top of `huggingface/alignment-handbook` - Maturity: **production-ready**. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo. ### Loss core (reference-free preference) SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory) with the **average log-probability** of the sequence under the policy, plus a **target reward margin** γ: ``` r(x, y) = (β / |y|) · log π_θ(y | x) ← length-normalised implicit reward (no reference model) L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [ log σ( r(x, y_w) − r(x, y_l) − γ ) ] ``` where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting point). Two consequences: (i) no `π_ref` forward pass per step → roughly half the memory, and (ii) the implicit reward is exactly the quantity the model generates from at decode time, removing a known DPO pathology where decoding-time and training-time rewards diverge. ### Why it composes with the existing stack - The framework's **channel 3** is multi-teacher trace-replay DPO. SimPO is a drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)` data contract, different loss head. So the trace-replay harvester does not change at all. - It does **not** touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two are complementary: JSD-distillation transfers token-level teacher knowledge, SimPO sharpens preference structure between trace alternatives. - It does **not** duplicate GRPO either. GRPO is online-policy RLVR; SimPO is offline preference. Different data sources. - The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points on AlpacaEval-2 LC, which directly translates to "if we already have channel-3 pairs, SimPO is a free upgrade". ### Implementation cost - **~80 LOC** for the trainer hook; the loss itself is ~15 lines (log-probs, length-normalise, margin, BCE). - Dependencies: nothing new — `torch`, `transformers` already in repo. - The reference implementation is a single file in `princeton-nlp/SimPO` (`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can vendor it exactly as we did with OPSD. --- ## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED ### Sources - **arXiv:** https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025) - **GitHub:** https://github.com/SakanaAI/TAID - License: **Apache-2.0** - 121 stars, last push 2025-10-06 (actively maintained) - Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free - Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale). - Maturity: **published, single-author commits** but reproducibly trained two SoTA compact models with it. ### Loss core (interpolated teacher target) Standard distillation losses (forward KL, reverse KL, JSD, including the `generalized_jsd_loss` we already have) target a **fixed** teacher distribution `p_T`. TAID replaces this fixed target with a **time-dependent interpolated target** `p_t` that starts close to the student and moves toward the teacher as training progresses: ``` p_t(y | x) = (1 − t) · q_θ_stop(y | x) + t · p_T(y | x) (1) J_TAID(θ; t) = D_KL( p_t ‖ q_θ ) (2) ``` `q_θ_stop` is the student's own current distribution with stop-gradient. The interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an **adaptive momentum schedule** that grows `t` faster when training loss is falling and slower when it stalls — this is the "temporally adaptive" part. The Sakana paper proves (Theorem 4.1) that for the regression analogue this schedule provably prevents the mode-collapse failure mode of pure self-distillation. Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the framework already exports**. TAID is therefore a *wrapper around an existing divergence*, not a competing divergence. ### Why it composes with the existing stack - It **wraps** `composer_replication.opsd.generalized_jsd_loss` rather than replacing it. The change is "compute the JSD against `p_t` instead of `p_T`" — a few lines around the existing call site. - Addresses a documented weakness of OPSD-style self-distillation: when the teacher's privileged-context distribution is far from the student's capacity, the JSD signal can be noisy or push the student into mode averaging. TAID's annealed target gives the student a curriculum. - Empirical evidence the Sakana paper directly compares with: TAID + JSD beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation, with **0.7 h / epoch** vs **9.8 h / epoch** for GKD on identical hardware. The speed comes from not needing student-generated outputs (SGOs) at every step the way GKD does. - Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO) because TAID lives strictly inside channel 2. ### Implementation cost - **~150 LOC**. The change is: 1. A `TAIDState` object that holds `t`, the EMA of training loss, and the momentum coefficient β (default 0.99). 2. A function `taid_target(student_logits, teacher_logits, t)` that returns `(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`. 3. A scheduler hook that updates `t` after each backward pass per Algorithm 1 of the paper. - Dependencies: nothing new. - Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is Apache-2.0 — vendor-friendly, same pattern as our OPSD lift. --- ## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED ### Sources - **OpenReview (ICLR 2026 Spotlight):** https://openreview.net/forum?id=WSRQ37tzk1 - **IBM Research page:** https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models - Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research) - Status: **ICLR 2026 Spotlight**, submission #113. License on the OpenReview record is **CC BY 4.0**. - Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. **Maturity flag: paper-ready, code-pending.** This is the only candidate where we'd need to re-implement from the paper. ### Loss core (entropy-gated forward/reverse KL mixture) The paper diagnoses a failure mode in the reverse-KL-on-policy distillation recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the **teacher distribution has high entropy at a given token**, reverse KL's mode-seeking gradient becomes noisy and collapses the student's diversity. Their fix: at each token `t`, gate between forward and reverse KL based on the teacher's entropy: ``` H_t = − Σ_v p_T(v | x, y_ Skew KL across student sizes, so TAID dominates this candidate. ### MiniLLM (Gu et al., ICLR 2024) - **arXiv:** https://arxiv.org/abs/2306.08543 - **GitHub:** https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo active (last push 2026-04) - **Loss core:** reverse KL minimised by policy-gradient on student rollouts, with three optimisation tricks: single-step decomposition (variance reduction), teacher-mixed sampling (anti-reward-hacking), length normalisation. - **Why excluded:** reverse-KL on-policy distillation **is the same recipe family as SDPO/OPSD** the framework already implements. Adding MiniLLM would be a parallel implementation of the same idea, not an addition. Entropy-Aware OPD (Candidate 3) is a *strict improvement* over MiniLLM's pure reverse-KL on exactly the failure mode MiniLLM identifies (mode collapse in high-entropy regions). ### Self-Rewarding Language Models (Yuan et al., 2024) - **arXiv:** https://arxiv.org/abs/2401.10020 (Meta + NYU) - **Why excluded:** SRLM is a *training procedure* (iterative DPO with the model judging its own outputs), not a loss. The actual loss is plain DPO, which the framework already supports. The procedural contribution belongs in a future ADR on data generation, not in the distillation channel. ### TAID's relationship to "TAID arXiv 2501.16937 if it exists" The user asked us to verify existence. **It exists.** Submitted 2025-01-28, ICLR 2025, code at https://github.com/SakanaAI/TAID with two released checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source. --- ## 2026 papers found The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`) surfaced four 2026 distillation papers worth listing for completeness: 1. **Entropy-Aware On-Policy Distillation** — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above. 2. **KL for a KL: On-Policy Distillation with Control Variate Baseline** (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient. 3. **Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe** (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation. 4. **Hybrid Policy Distillation for LLMs** (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument. 5. **Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation** (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available. None of these except Entropy-Aware OPD are mature enough (released code + license + reproducible scale) to recommend adding right now. --- ## Recommended follow-up wiring For ADR-007 the proposed addition is a `composer_replication.distillation` sub-package with three pluggable hooks: > **Realised in v0.1 (Wave 17 update):** ADR-007 shipped a flatter > layout than the proposal below. Actual exports: > > ``` > composer_replication/ > distillation/ > __init__.py > simpo.py # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma) > # avg_sequence_logprob(logprobs, mask) -- helper > taid.py # taid_loss(student_logits, teacher_logits, t, *, ...) > # TAIDScheduler -- adaptive momentum schedule per the paper > entropy_aware_opd.py # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...) > ``` > > No `targets.py`/`losses.py` split, no top-level `preference/` package, > and SimPO lives under `distillation/` rather than `preference/` because > the three losses share a common dispatch surface (`compose_loss`'s > `dpo_variant` and `sdpo_wrapper` switches). > > The composition rule realised in `compose_loss` is per-loss flag-driven, > not a single composed-function call: > > ```python > compose_loss(model, inputs, > dpo_variant="simpo", # OR "dpo" (default) > sdpo_wrapper="taid", # OR "entropy_opd" OR "none" (default) > taid_t=0.5, # required when sdpo_wrapper="taid" > simpo_beta=2.0, simpo_gamma=1.0, # used only when dpo_variant="simpo" > entropy_opd_h_max=..., # used only when sdpo_wrapper="entropy_opd" > ) > ``` > > The pre-ADR proposal sketch below is preserved as historical context. > The shipped function names are `simpo_loss`, `taid_loss` + > `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` / > `entropy_aware_kl_loss`). ``` composer_replication/ distillation/ __init__.py targets.py # taid_target(...), fixed_target(...) ← Candidate 2 losses.py # reuses opsd.generalized_jsd_loss # adds entropy_aware_kl_loss(...) ← Candidate 3 preference/ simpo.py # simpo_loss(...) ← Candidate 1 dpo.py # existing trace-replay path ``` The composition rule for the total loss becomes: ``` L_total = λ_grpo · L_GRPO (channel 1, unchanged) + λ_distill · L_distill (channel 2, see below) + λ_pref · L_pref (channel 3, choose DPO or SimPO) L_distill = entropy_aware_kl_loss( target = taid_target(student, teacher, t), student = student, teacher_entropy_gate = α_t ) ``` This keeps the existing `generalized_jsd_loss` reachable as a fallback (set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly). --- ## Sources index | Paper | arXiv | GitHub | License | Last push | Maturity | |-------|-------|--------|---------|-----------|----------| | SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production | | TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production | | Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only | | KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production | | GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only | | DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research | | MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production | | Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss | --- ## Notes for ADR-007 author 1. **SimPO and TAID can land independently and without coordination.** They touch different files and do not compete. 2. **Entropy-Aware OPD should land last.** Wait for the IBM/KAIST authors' code release; if it's not out by the time we want to ship the change, the formula is simple enough to re-derive but we should pin a unit test that reproduces the paper's Figure 3 entropy-vs-divergence behaviour. 3. **Do not also pull in GKD/DistiLLM/MiniLLM.** Their loss contributions are strict subsets of what (TAID + Entropy-Aware OPD + existing `generalized_jsd_loss`) covers. 4. **KTO should be added as a backlog item** with a "trigger" condition: when the trace-replay reward design moves from preference pairs to per-step binary signals, switch on the `trl.KTOTrainer` path. --- *Absolute path of this report:* `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md`