composer-replication-framework / docs /research /SELF_DISTILLATION_LANDSCAPE.md
Codeseys's picture
Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports
a84c060

Self-Distillation Landscape Audit (feeds ADR-007)

Status: research note, pre-experimental Author: subagent audit Date: 2026-05-25 Scope: identify 2–3 distillation-channel losses worth adding to composer_replication alongside the existing GRPO + SDPO/OPSD generalized_jsd_loss + multi-teacher trace-replay DPO stack. Bias: additivity over novelty. We are looking for losses that COMPOSE with what is already implemented, not duplicates of it.


TL;DR — recommended additions

Rank Method Loss role License LOC est. Why it composes
1 SimPO (NeurIPS 2024) Preference, reference-free MIT ~80 Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel
2 TAID (ICLR 2025) Interpolated-target wrapper around any KL/JSD Apache-2.0 ~150 Wraps the existing generalized_jsd_loss — does not replace it. Closes capacity gap on small students
3 Entropy-Aware OPD (ICLR 2026 Spotlight) Token-gated forward/reverse KL mixture CC BY 4.0 (paper); code expected ~120 Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2

Honourable mention: KTO — useful only if the framework wants to ingest binary thumbs-up/thumbs-down trace signals without preference pairs. Not recommended: GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).


Audit method

For each candidate paper (the seven the user named, plus 2026 follow-ups discovered via Exa search restricted to category=research paper, startPublishedDate=2026-01-01) we verified:

  1. Primary source exists. arXiv abstract page reachable; HTML body parsed to extract the actual loss formula (not summarised from secondary sources).
  2. Code is real. Official repo's README was fetched, last push date and star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained were marked as such.
  3. License is permissive enough. MIT, Apache-2.0, BSD, CC BY 4.0 are acceptable for inclusion. GPL or research-only would be flagged.
  4. Composability check. Read the framework's existing composer_replication/__init__.py and research/05-trace-replay-distillation.md, then asked: does this loss replace something we have, or stack on top?

Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED

Sources

  • arXiv: https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
  • GitHub: https://github.com/princeton-nlp/SimPO
    • License: MIT
    • 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
    • Built on top of huggingface/alignment-handbook
  • Maturity: production-ready. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.

Loss core (reference-free preference)

SimPO replaces the DPO log-ratio (which requires keeping π_ref in memory) with the average log-probability of the sequence under the policy, plus a target reward margin γ:

r(x, y) = (β / |y|) · log π_θ(y | x)        ← length-normalised implicit reward
                                               (no reference model)

L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
    log σ( r(x, y_w) − r(x, y_l) − γ )
]

where β is a temperature (typically 2.0–10) and γ is the desired margin between chosen and rejected (the repo recommends γ/β ≈ 0.5 as a starting point). Two consequences: (i) no π_ref forward pass per step → roughly half the memory, and (ii) the implicit reward is exactly the quantity the model generates from at decode time, removing a known DPO pathology where decoding-time and training-time rewards diverge.

Why it composes with the existing stack

  • The framework's channel 3 is multi-teacher trace-replay DPO. SimPO is a drop-in replacement for the DPO step inside that channel — same (x, y_w, y_l) data contract, different loss head. So the trace-replay harvester does not change at all.
  • It does not touch channel 2 (SDPO/OPSD generalized_jsd_loss). The two are complementary: JSD-distillation transfers token-level teacher knowledge, SimPO sharpens preference structure between trace alternatives.
  • It does not duplicate GRPO either. GRPO is online-policy RLVR; SimPO is offline preference. Different data sources.
  • The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points on AlpacaEval-2 LC, which directly translates to "if we already have channel-3 pairs, SimPO is a free upgrade".

Implementation cost

  • ~80 LOC for the trainer hook; the loss itself is ~15 lines (log-probs, length-normalise, margin, BCE).
  • Dependencies: nothing new — torch, transformers already in repo.
  • The reference implementation is a single file in princeton-nlp/SimPO (scripts/run_simpo.py + alignment/ trainer subclass) under MIT, so we can vendor it exactly as we did with OPSD.

Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED

Sources

  • arXiv: https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
  • GitHub: https://github.com/SakanaAI/TAID
    • License: Apache-2.0
    • 121 stars, last push 2025-10-06 (actively maintained)
    • Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in src/distil_losses/ for free
  • Released artefacts: TAID-LLM-1.5B, TAID-VLM-2B on HuggingFace (so the loss is verified at non-trivial scale).
  • Maturity: published, single-author commits but reproducibly trained two SoTA compact models with it.

Loss core (interpolated teacher target)

Standard distillation losses (forward KL, reverse KL, JSD, including the generalized_jsd_loss we already have) target a fixed teacher distribution p_T. TAID replaces this fixed target with a time-dependent interpolated target p_t that starts close to the student and moves toward the teacher as training progresses:

p_t(y | x) = (1 − t) · q_θ_stop(y | x)  +  t · p_T(y | x)         (1)

J_TAID(θ; t) = D_KL( p_t ‖ q_θ )                                  (2)

q_θ_stop is the student's own current distribution with stop-gradient. The interpolation coefficient t ∈ [t_start, 1] is updated each step by an adaptive momentum schedule that grows t faster when training loss is falling and slower when it stalls — this is the "temporally adaptive" part. The Sakana paper proves (Theorem 4.1) that for the regression analogue this schedule provably prevents the mode-collapse failure mode of pure self-distillation.

Critically, D_KL(p_t ‖ q_θ) is just any divergence on shifted target — you can equally well plug in JSD, reverse KL, or the generalized_jsd_loss the framework already exports. TAID is therefore a wrapper around an existing divergence, not a competing divergence.

Why it composes with the existing stack

  • It wraps composer_replication.opsd.generalized_jsd_loss rather than replacing it. The change is "compute the JSD against p_t instead of p_T" — a few lines around the existing call site.
  • Addresses a documented weakness of OPSD-style self-distillation: when the teacher's privileged-context distribution is far from the student's capacity, the JSD signal can be noisy or push the student into mode averaging. TAID's annealed target gives the student a curriculum.
  • Empirical evidence the Sakana paper directly compares with: TAID + JSD beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation, with 0.7 h / epoch vs 9.8 h / epoch for GKD on identical hardware. The speed comes from not needing student-generated outputs (SGOs) at every step the way GKD does.
  • Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO) because TAID lives strictly inside channel 2.

Implementation cost

  • ~150 LOC. The change is:
    1. A TAIDState object that holds t, the EMA of training loss, and the momentum coefficient β (default 0.99).
    2. A function taid_target(student_logits, teacher_logits, t) that returns (1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits).
    3. A scheduler hook that updates t after each backward pass per Algorithm 1 of the paper.
  • Dependencies: nothing new.
  • Reference implementation in SakanaAI/TAID/src/distil_losses/taid.py is Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.

Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED

Sources

  • OpenReview (ICLR 2026 Spotlight): https://openreview.net/forum?id=WSRQ37tzk1
  • IBM Research page: https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
  • Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
  • Status: ICLR 2026 Spotlight, submission #113. License on the OpenReview record is CC BY 4.0.
  • Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. Maturity flag: paper-ready, code-pending. This is the only candidate where we'd need to re-implement from the paper.

Loss core (entropy-gated forward/reverse KL mixture)

The paper diagnoses a failure mode in the reverse-KL-on-policy distillation recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the teacher distribution has high entropy at a given token, reverse KL's mode-seeking gradient becomes noisy and collapses the student's diversity. Their fix: at each token t, gate between forward and reverse KL based on the teacher's entropy:

H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t)        (teacher entropy)

α_t = sigmoid( (H_t − τ) / s )                              ∈ (0, 1)

L_EA(θ) = E_{y ~ q_θ} Σ_t [
    (1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) )    ← reverse KL
  +     α_t   · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) )    ← forward KL
]

τ is an entropy threshold (default ≈ 1.0 nat in their experiments) and s is a temperature controlling how sharp the gate is. When the teacher is confident (H_t small → α_t ≈ 0) the loss is pure reverse KL, identical to MiniLLM/OPSD behaviour. When the teacher is uncertain (H_t large → α_t ≈ 1) the loss switches to forward KL, which is mode-covering and preserves student diversity.

Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The larger gains at larger student size suggest the failure mode reverse KL exhibits gets worse with capacity, not better.

Why it composes with the existing stack

  • It is strictly token-wise: same trajectory, same teacher logits, same rollout pipeline as the existing channel 2. The only change is the loss reduction — instead of computing generalized_jsd_loss with a single fixed β, you compute a per-token mixture of forward and reverse KL with weight given by teacher entropy.
  • This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is privileged-context teacher distribution under student rollouts. EA-OPD's contribution is which divergence to use at each token of that distribution. Both can be true simultaneously.
  • Directly addresses a failure mode the framework's roadmap will hit: multi-teacher trace replay (channel 3) produces high-entropy aggregated teacher distributions at exactly the steps where teachers disagree. Those are the steps where reverse KL behaves worst. EA-OPD's entropy gate would automatically soften the loss on those exact tokens.
  • Composes with TAID (Candidate 2) too — they operate on different axes: TAID anneals the target distribution, EA-OPD chooses the divergence direction. Stacking is straightforward and proposed as ADR-007 follow-up.

Implementation cost

  • ~120 LOC estimate (no reference code to vendor yet).
  • Dependencies: nothing new. Token-level entropy is −(p * log p).sum(-1), forward KL is the existing teacher-on-student term, reverse KL is the student-on-teacher term we already compute for the JSD in OPSD. The work is re-shaping the existing per-token loss to expose both directions.
  • Risk note: code not yet public. We should hold this candidate behind a feature flag until the IBM/KAIST team releases reference code (expected by ICLR 2026 in May). If the implementation ships sooner we should vendor and match line-for-line; if not, we re-derive from the paper formula and add a unit test that reproduces their toy entropy-vs-divergence plot.

Honourable mention — KTO (Kahneman-Tversky Optimization)

  • arXiv: https://arxiv.org/abs/2402.01306
  • Code: integrated into HuggingFace trl library since v0.8 (Apache-2.0).
  • License/maturity: production. KTO is a standard trl trainer alongside DPO.

Loss core

KTO replaces preference pairs with per-output binary desirability signals. For a desirable output y_+ and undesirable output y_−:

r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )

z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ]      (reference point)

L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))]        (desirable)
      + E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))]        (undesirable)

with default λ_D = λ_U = 1. The derivation is via prospect theory: this is a Kahneman-Tversky utility function applied to the implicit reward. KTO matches DPO at 1B–30B even though it sees only 2n binary signals where DPO sees n pairs.

Why we down-rank it relative to the top-3

KTO is the right answer only if the framework wants to ingest single-side trace signals (e.g., "this trace step succeeded" / "this step crashed the agent") without constructing pairs. The current research/05-trace-replay-distillation.md design does construct pairs from multi-teacher replay (that is the whole point of the multi-teacher variance signal), so the marginal value of KTO is small for channel 3 as specified. If the trace-replay design pivots toward absolute scores per step rather than relative pairs, KTO becomes the right loss and is already free from trl. Add to the backlog as conditional.


Audited but NOT recommended

GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)

  • arXiv: https://arxiv.org/abs/2306.13649 (Google DeepMind)
  • Loss core: student samples its own outputs, teacher provides token probabilities, divergence is generalized JSD with parameter β:
    D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q)
    
  • Why excluded: this is exactly the formula we already have as composer_replication.opsd.generalized_jsd_loss (lifted from siyan-zhao/OPSD). GKD's contribution beyond the loss formula is the on-policy student sampling protocol — which OPSD also does. No incremental value to add.

DistiLLM (Ko et al., ICML 2024)

  • arXiv: https://arxiv.org/abs/2402.03898
  • GitHub: https://github.com/jongwooko/distillm — MIT, last push 2025-03
  • Loss core: Skew KL divergence KL(p ‖ λp + (1−λ)q) plus an adaptive off-policy student-generated-output (SGO) scheduler.
  • Why excluded: the skew-KL is a special case of generalized JSD (set the mixture coefficient appropriately) — same family the framework already has. The interesting contribution, the SGO scheduler, is a process optimisation, not a loss. The TAID paper's own ablation (Table 6) shows TAID > Skew KL across student sizes, so TAID dominates this candidate.

MiniLLM (Gu et al., ICLR 2024)

  • arXiv: https://arxiv.org/abs/2306.08543
  • GitHub: https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo active (last push 2026-04)
  • Loss core: reverse KL minimised by policy-gradient on student rollouts, with three optimisation tricks: single-step decomposition (variance reduction), teacher-mixed sampling (anti-reward-hacking), length normalisation.
  • Why excluded: reverse-KL on-policy distillation is the same recipe family as SDPO/OPSD the framework already implements. Adding MiniLLM would be a parallel implementation of the same idea, not an addition. Entropy-Aware OPD (Candidate 3) is a strict improvement over MiniLLM's pure reverse-KL on exactly the failure mode MiniLLM identifies (mode collapse in high-entropy regions).

Self-Rewarding Language Models (Yuan et al., 2024)

  • arXiv: https://arxiv.org/abs/2401.10020 (Meta + NYU)
  • Why excluded: SRLM is a training procedure (iterative DPO with the model judging its own outputs), not a loss. The actual loss is plain DPO, which the framework already supports. The procedural contribution belongs in a future ADR on data generation, not in the distillation channel.

TAID's relationship to "TAID arXiv 2501.16937 if it exists"

The user asked us to verify existence. It exists. Submitted 2025-01-28, ICLR 2025, code at https://github.com/SakanaAI/TAID with two released checkpoints (TAID-LLM-1.5B, TAID-VLM-2B). Confirmed primary source.


2026 papers found

The targeted Exa search (category=research paper, startPublishedDate=2026-01-01) surfaced four 2026 distillation papers worth listing for completeness:

  1. Entropy-Aware On-Policy Distillation — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
  2. KL for a KL: On-Policy Distillation with Control Variate Baseline (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
  3. Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
  4. Hybrid Policy Distillation for LLMs (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
  5. Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.

None of these except Entropy-Aware OPD are mature enough (released code + license + reproducible scale) to recommend adding right now.


Recommended follow-up wiring

For ADR-007 the proposed addition is a composer_replication.distillation sub-package with three pluggable hooks:

Realised in v0.1 (Wave 17 update): ADR-007 shipped a flatter layout than the proposal below. Actual exports:

composer_replication/
  distillation/
    __init__.py
    simpo.py              # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma)
                          # avg_sequence_logprob(logprobs, mask)  -- helper
    taid.py               # taid_loss(student_logits, teacher_logits, t, *, ...)
                          # TAIDScheduler  -- adaptive momentum schedule per the paper
    entropy_aware_opd.py  # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)

No targets.py/losses.py split, no top-level preference/ package, and SimPO lives under distillation/ rather than preference/ because the three losses share a common dispatch surface (compose_loss's dpo_variant and sdpo_wrapper switches).

The composition rule realised in compose_loss is per-loss flag-driven, not a single composed-function call:

compose_loss(model, inputs,
    dpo_variant="simpo",          # OR "dpo" (default)
    sdpo_wrapper="taid",          # OR "entropy_opd" OR "none" (default)
    taid_t=0.5,                    # required when sdpo_wrapper="taid"
    simpo_beta=2.0, simpo_gamma=1.0,  # used only when dpo_variant="simpo"
    entropy_opd_h_max=...,         # used only when sdpo_wrapper="entropy_opd"
)

The pre-ADR proposal sketch below is preserved as historical context. The shipped function names are simpo_loss, taid_loss + TAIDScheduler, and entropy_aware_opd_loss (not taid_target / entropy_aware_kl_loss).

composer_replication/
  distillation/
    __init__.py
    targets.py        # taid_target(...), fixed_target(...)         ← Candidate 2
    losses.py         # reuses opsd.generalized_jsd_loss
                       # adds entropy_aware_kl_loss(...)             ← Candidate 3
  preference/
    simpo.py          # simpo_loss(...)                              ← Candidate 1
    dpo.py            # existing trace-replay path

The composition rule for the total loss becomes:

L_total =   λ_grpo · L_GRPO            (channel 1, unchanged)
        + λ_distill · L_distill        (channel 2, see below)
        +    λ_pref · L_pref           (channel 3, choose DPO or SimPO)

L_distill = entropy_aware_kl_loss(
                target = taid_target(student, teacher, t),
                student = student,
                teacher_entropy_gate = α_t
            )

This keeps the existing generalized_jsd_loss reachable as a fallback (set α_t ≡ 0 and t ≡ 1 and you recover SDPO/OPSD exactly).


Sources index

Paper arXiv GitHub License Last push Maturity
SimPO https://arxiv.org/abs/2405.14734 https://github.com/princeton-nlp/SimPO MIT 2024-10-12 Production
TAID https://arxiv.org/abs/2501.16937 https://github.com/SakanaAI/TAID Apache-2.0 2025-10-06 Production
Entropy-Aware OPD n/a (OpenReview WSRQ37tzk1) code-pending CC BY 4.0 (paper) n/a Paper-only
KTO https://arxiv.org/abs/2402.01306 huggingface/trl (built-in) Apache-2.0 continuous Production
GKD https://arxiv.org/abs/2306.13649 (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) n/a n/a Reference only
DistiLLM https://arxiv.org/abs/2402.03898 https://github.com/jongwooko/distillm (no LICENSE file at audit time) 2025-03-13 Research
MiniLLM https://arxiv.org/abs/2306.08543 https://github.com/microsoft/LMOps/tree/main/minillm MIT 2026-04-08 Production
Self-Rewarding LM https://arxiv.org/abs/2401.10020 (no canonical repo; integrated into many forks) n/a n/a Procedure, not a loss

Notes for ADR-007 author

  1. SimPO and TAID can land independently and without coordination. They touch different files and do not compete.
  2. Entropy-Aware OPD should land last. Wait for the IBM/KAIST authors' code release; if it's not out by the time we want to ship the change, the formula is simple enough to re-derive but we should pin a unit test that reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
  3. Do not also pull in GKD/DistiLLM/MiniLLM. Their loss contributions are strict subsets of what (TAID + Entropy-Aware OPD + existing generalized_jsd_loss) covers.
  4. KTO should be added as a backlog item with a "trigger" condition: when the trace-replay reward design moves from preference pairs to per-step binary signals, switch on the trl.KTOTrainer path.

Absolute path of this report: /mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md