File size: 23,982 Bytes

# Self-Distillation Landscape Audit (feeds ADR-007)

**Status:** research note, pre-experimental
**Author:** subagent audit
**Date:** 2026-05-25
**Scope:** identify 2–3 distillation-channel losses worth adding to
`composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` +
multi-teacher trace-replay DPO stack.
**Bias:** additivity over novelty. We are looking for losses that COMPOSE with
what is already implemented, not duplicates of it.

---

## TL;DR — recommended additions

| Rank | Method | Loss role | License | LOC est. | Why it composes |
|------|--------|-----------|---------|----------|-----------------|
| 1 | **SimPO** (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel |
| 2 | **TAID** (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students |
| 3 | **Entropy-Aware OPD** (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 |

**Honourable mention:** KTO — useful only if the framework wants to ingest
binary thumbs-up/thumbs-down trace signals without preference pairs.
**Not recommended:** GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).

---

## Audit method

For each candidate paper (the seven the user named, plus 2026 follow-ups
discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`)
we verified:

1. **Primary source exists.** arXiv abstract page reachable; HTML body parsed
   to extract the actual loss formula (not summarised from secondary sources).
2. **Code is real.** Official repo's README was fetched, `last push` date and
   star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained
   were marked as such.
3. **License is permissive enough.** MIT, Apache-2.0, BSD, CC BY 4.0 are
   acceptable for inclusion. GPL or research-only would be flagged.
4. **Composability check.** Read the framework's existing
   `composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`,
   then asked: *does this loss replace something we have, or stack on top?*

---

## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED

### Sources
- **arXiv:** https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
- **GitHub:** https://github.com/princeton-nlp/SimPO
  - License: **MIT**
  - 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
  - Built on top of `huggingface/alignment-handbook`
- Maturity: **production-ready**. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.

### Loss core (reference-free preference)
SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory)
with the **average log-probability** of the sequence under the policy, plus
a **target reward margin** γ:

```
r(x, y) = (β / |y|) · log π_θ(y | x)        ← length-normalised implicit reward
                                               (no reference model)

L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
    log σ( r(x, y_w) − r(x, y_l) − γ )
]
```

where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin
between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting
point). Two consequences: (i) no `π_ref` forward pass per step → roughly half
the memory, and (ii) the implicit reward is exactly the quantity the model
generates from at decode time, removing a known DPO pathology where
decoding-time and training-time rewards diverge.

### Why it composes with the existing stack
- The framework's **channel 3** is multi-teacher trace-replay DPO. SimPO is a
  drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)`
  data contract, different loss head. So the trace-replay harvester does not
  change at all.
- It does **not** touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two
  are complementary: JSD-distillation transfers token-level teacher knowledge,
  SimPO sharpens preference structure between trace alternatives.
- It does **not** duplicate GRPO either. GRPO is online-policy RLVR;
  SimPO is offline preference. Different data sources.
- The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points
  on AlpacaEval-2 LC, which directly translates to "if we already have channel-3
  pairs, SimPO is a free upgrade".

### Implementation cost
- **~80 LOC** for the trainer hook; the loss itself is ~15 lines (log-probs,
  length-normalise, margin, BCE).
- Dependencies: nothing new — `torch`, `transformers` already in repo.
- The reference implementation is a single file in `princeton-nlp/SimPO`
  (`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can
  vendor it exactly as we did with OPSD.

---

## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED

### Sources
- **arXiv:** https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
- **GitHub:** https://github.com/SakanaAI/TAID
  - License: **Apache-2.0**
  - 121 stars, last push 2025-10-06 (actively maintained)
  - Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free
- Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale).
- Maturity: **published, single-author commits** but reproducibly trained two SoTA compact models with it.

### Loss core (interpolated teacher target)
Standard distillation losses (forward KL, reverse KL, JSD, including the
`generalized_jsd_loss` we already have) target a **fixed** teacher distribution
`p_T`. TAID replaces this fixed target with a **time-dependent interpolated
target** `p_t` that starts close to the student and moves toward the teacher
as training progresses:

```
p_t(y | x) = (1 − t) · q_θ_stop(y | x)  +  t · p_T(y | x)         (1)

J_TAID(θ; t) = D_KL( p_t ‖ q_θ )                                  (2)
```

`q_θ_stop` is the student's own current distribution with stop-gradient. The
interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an
**adaptive momentum schedule** that grows `t` faster when training loss is
falling and slower when it stalls — this is the "temporally adaptive" part.
The Sakana paper proves (Theorem 4.1) that for the regression analogue this
schedule provably prevents the mode-collapse failure mode of pure
self-distillation.

Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you
can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the
framework already exports**. TAID is therefore a *wrapper around an existing
divergence*, not a competing divergence.

### Why it composes with the existing stack
- It **wraps** `composer_replication.opsd.generalized_jsd_loss` rather than
  replacing it. The change is "compute the JSD against `p_t` instead of
  `p_T`" — a few lines around the existing call site.
- Addresses a documented weakness of OPSD-style self-distillation: when the
  teacher's privileged-context distribution is far from the student's
  capacity, the JSD signal can be noisy or push the student into mode
  averaging. TAID's annealed target gives the student a curriculum.
- Empirical evidence the Sakana paper directly compares with: TAID + JSD
  beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation,
  with **0.7 h / epoch** vs **9.8 h / epoch** for GKD on identical hardware.
  The speed comes from not needing student-generated outputs (SGOs) at every
  step the way GKD does.
- Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO)
  because TAID lives strictly inside channel 2.

### Implementation cost
- **~150 LOC**. The change is:
  1. A `TAIDState` object that holds `t`, the EMA of training loss, and the
     momentum coefficient β (default 0.99).
  2. A function `taid_target(student_logits, teacher_logits, t)` that returns
     `(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`.
  3. A scheduler hook that updates `t` after each backward pass per
     Algorithm 1 of the paper.
- Dependencies: nothing new.
- Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is
  Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.

---

## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED

### Sources
- **OpenReview (ICLR 2026 Spotlight):** https://openreview.net/forum?id=WSRQ37tzk1
- **IBM Research page:** https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
- Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
- Status: **ICLR 2026 Spotlight**, submission #113. License on the OpenReview record is **CC BY 4.0**.
- Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. **Maturity flag: paper-ready, code-pending.** This is the only candidate where we'd need to re-implement from the paper.

### Loss core (entropy-gated forward/reverse KL mixture)
The paper diagnoses a failure mode in the reverse-KL-on-policy distillation
recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the
**teacher distribution has high entropy at a given token**, reverse KL's
mode-seeking gradient becomes noisy and collapses the student's diversity.
Their fix: at each token `t`, gate between forward and reverse KL based on
the teacher's entropy:

```
H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t)        (teacher entropy)

α_t = sigmoid( (H_t − τ) / s )                              ∈ (0, 1)

L_EA(θ) = E_{y ~ q_θ} Σ_t [
    (1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) )    ← reverse KL
  +     α_t   · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) )    ← forward KL
]
```

`τ` is an entropy threshold (default ≈ 1.0 nat in their experiments) and `s`
is a temperature controlling how sharp the gate is. When the teacher is
confident (`H_t` small → `α_t ≈ 0`) the loss is pure reverse KL, identical to
MiniLLM/OPSD behaviour. When the teacher is uncertain (`H_t` large → `α_t ≈ 1`)
the loss switches to forward KL, which is mode-covering and preserves
student diversity.

Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on
six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The
larger gains at larger student size suggest the failure mode reverse KL
exhibits gets *worse* with capacity, not better.

### Why it composes with the existing stack
- It is **strictly token-wise**: same trajectory, same teacher logits, same
  rollout pipeline as the existing channel 2. The only change is the loss
  reduction — instead of computing `generalized_jsd_loss` with a single fixed
  β, you compute a per-token mixture of forward and reverse KL with weight
  given by teacher entropy.
- This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is
  *privileged-context teacher distribution under student rollouts*. EA-OPD's
  contribution is *which divergence to use at each token of that distribution*.
  Both can be true simultaneously.
- Directly addresses a failure mode the framework's roadmap will hit:
  multi-teacher trace replay (channel 3) produces high-entropy aggregated
  teacher distributions at exactly the steps where teachers disagree. Those
  are the steps where reverse KL behaves worst. EA-OPD's entropy gate would
  automatically soften the loss on those exact tokens.
- Composes with TAID (Candidate 2) too — they operate on different axes:
  TAID anneals the *target distribution*, EA-OPD chooses the *divergence
  direction*. Stacking is straightforward and proposed as ADR-007 follow-up.

### Implementation cost
- **~120 LOC** estimate (no reference code to vendor yet).
- Dependencies: nothing new. Token-level entropy is `−(p * log p).sum(-1)`,
  forward KL is the existing teacher-on-student term, reverse KL is the
  student-on-teacher term we already compute for the JSD in OPSD. The work is
  re-shaping the existing per-token loss to expose both directions.
- **Risk note:** code not yet public. We should hold this candidate behind a
  feature flag until the IBM/KAIST team releases reference code (expected by
  ICLR 2026 in May). If the implementation ships sooner we should vendor and
  match line-for-line; if not, we re-derive from the paper formula and add a
  unit test that reproduces their toy entropy-vs-divergence plot.

---

## Honourable mention — KTO (Kahneman-Tversky Optimization)

- **arXiv:** https://arxiv.org/abs/2402.01306
- **Code:** integrated into HuggingFace `trl` library since v0.8 (Apache-2.0).
- License/maturity: **production**. KTO is a standard `trl` trainer alongside DPO.

### Loss core
KTO replaces preference pairs with **per-output binary desirability** signals.
For a desirable output `y_+` and undesirable output `y_−`:

```
r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )

z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ]      (reference point)

L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))]        (desirable)
      + E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))]        (undesirable)
```

with default `λ_D = λ_U = 1`. The derivation is via prospect theory: this is
a Kahneman-Tversky utility function applied to the implicit reward. KTO
matches DPO at 1B–30B even though it sees only `2n` binary signals where
DPO sees `n` pairs.

### Why we down-rank it relative to the top-3
KTO is the right answer **only if** the framework wants to ingest single-side
trace signals (e.g., "this trace step succeeded" / "this step crashed the
agent") without constructing pairs. The current
`research/05-trace-replay-distillation.md` design **does** construct pairs
from multi-teacher replay (that is the whole point of the multi-teacher
variance signal), so the marginal value of KTO is small *for channel 3 as
specified*. If the trace-replay design pivots toward absolute scores per
step rather than relative pairs, KTO becomes the right loss and is already
free from `trl`. Add to the backlog as conditional.

---

## Audited but NOT recommended

### GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)
- **arXiv:** https://arxiv.org/abs/2306.13649 (Google DeepMind)
- **Loss core:** student samples its own outputs, teacher provides token
  probabilities, divergence is generalized JSD with parameter β:
  ```
  D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q)
  ```
- **Why excluded:** **this is exactly the formula we already have** as
  `composer_replication.opsd.generalized_jsd_loss` (lifted from
  `siyan-zhao/OPSD`). GKD's contribution beyond the loss formula is the
  on-policy student sampling protocol — which OPSD also does. No incremental
  value to add.

### DistiLLM (Ko et al., ICML 2024)
- **arXiv:** https://arxiv.org/abs/2402.03898
- **GitHub:** https://github.com/jongwooko/distillm — MIT, last push 2025-03
- **Loss core:** *Skew KL divergence* `KL(p ‖ λp + (1−λ)q)` plus an *adaptive
  off-policy* student-generated-output (SGO) scheduler.
- **Why excluded:** the skew-KL is a special case of generalized JSD (set the
  mixture coefficient appropriately) — same family the framework already
  has. The interesting contribution, the SGO scheduler, is a process
  optimisation, not a loss. The TAID paper's own ablation (Table 6) shows
  TAID > Skew KL across student sizes, so TAID dominates this candidate.

### MiniLLM (Gu et al., ICLR 2024)
- **arXiv:** https://arxiv.org/abs/2306.08543
- **GitHub:** https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo
  active (last push 2026-04)
- **Loss core:** reverse KL minimised by policy-gradient on student rollouts,
  with three optimisation tricks: single-step decomposition (variance
  reduction), teacher-mixed sampling (anti-reward-hacking), length
  normalisation.
- **Why excluded:** reverse-KL on-policy distillation **is the same recipe
  family as SDPO/OPSD** the framework already implements. Adding MiniLLM
  would be a parallel implementation of the same idea, not an addition.
  Entropy-Aware OPD (Candidate 3) is a *strict improvement* over MiniLLM's
  pure reverse-KL on exactly the failure mode MiniLLM identifies (mode
  collapse in high-entropy regions).

### Self-Rewarding Language Models (Yuan et al., 2024)
- **arXiv:** https://arxiv.org/abs/2401.10020 (Meta + NYU)
- **Why excluded:** SRLM is a *training procedure* (iterative DPO with the
  model judging its own outputs), not a loss. The actual loss is plain DPO,
  which the framework already supports. The procedural contribution belongs
  in a future ADR on data generation, not in the distillation channel.

### TAID's relationship to "TAID arXiv 2501.16937 if it exists"
The user asked us to verify existence. **It exists.** Submitted 2025-01-28,
ICLR 2025, code at https://github.com/SakanaAI/TAID with two released
checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source.

---

## 2026 papers found

The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`)
surfaced four 2026 distillation papers worth listing for completeness:

1. **Entropy-Aware On-Policy Distillation** — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
2. **KL for a KL: On-Policy Distillation with Control Variate Baseline** (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
3. **Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe** (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
4. **Hybrid Policy Distillation for LLMs** (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
5. **Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation** (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.

None of these except Entropy-Aware OPD are mature enough (released code +
license + reproducible scale) to recommend adding right now.

---

## Recommended follow-up wiring

For ADR-007 the proposed addition is a `composer_replication.distillation`
sub-package with three pluggable hooks:

> **Realised in v0.1 (Wave 17 update):** ADR-007 shipped a flatter
> layout than the proposal below. Actual exports:
>
> ```
> composer_replication/
>   distillation/
>     __init__.py
>     simpo.py              # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma)
>                           # avg_sequence_logprob(logprobs, mask)  -- helper
>     taid.py               # taid_loss(student_logits, teacher_logits, t, *, ...)
>                           # TAIDScheduler  -- adaptive momentum schedule per the paper
>     entropy_aware_opd.py  # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)
> ```
>
> No `targets.py`/`losses.py` split, no top-level `preference/` package,
> and SimPO lives under `distillation/` rather than `preference/` because
> the three losses share a common dispatch surface (`compose_loss`'s
> `dpo_variant` and `sdpo_wrapper` switches).
>
> The composition rule realised in `compose_loss` is per-loss flag-driven,
> not a single composed-function call:
>
> ```python
> compose_loss(model, inputs,
>     dpo_variant="simpo",          # OR "dpo" (default)
>     sdpo_wrapper="taid",          # OR "entropy_opd" OR "none" (default)
>     taid_t=0.5,                    # required when sdpo_wrapper="taid"
>     simpo_beta=2.0, simpo_gamma=1.0,  # used only when dpo_variant="simpo"
>     entropy_opd_h_max=...,         # used only when sdpo_wrapper="entropy_opd"
> )
> ```
>
> The pre-ADR proposal sketch below is preserved as historical context.
> The shipped function names are `simpo_loss`, `taid_loss` +
> `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` /
> `entropy_aware_kl_loss`).

```
composer_replication/
  distillation/
    __init__.py
    targets.py        # taid_target(...), fixed_target(...)         ← Candidate 2
    losses.py         # reuses opsd.generalized_jsd_loss
                       # adds entropy_aware_kl_loss(...)             ← Candidate 3
  preference/
    simpo.py          # simpo_loss(...)                              ← Candidate 1
    dpo.py            # existing trace-replay path
```

The composition rule for the total loss becomes:

```
L_total =   λ_grpo · L_GRPO            (channel 1, unchanged)
        + λ_distill · L_distill        (channel 2, see below)
        +    λ_pref · L_pref           (channel 3, choose DPO or SimPO)

L_distill = entropy_aware_kl_loss(
                target = taid_target(student, teacher, t),
                student = student,
                teacher_entropy_gate = α_t
            )
```

This keeps the existing `generalized_jsd_loss` reachable as a fallback
(set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly).

---

## Sources index

| Paper | arXiv | GitHub | License | Last push | Maturity |
|-------|-------|--------|---------|-----------|----------|
| SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production |
| TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production |
| Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only |
| KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production |
| GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only |
| DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research |
| MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production |
| Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss |

---

## Notes for ADR-007 author

1. **SimPO and TAID can land independently and without coordination.** They
   touch different files and do not compete.
2. **Entropy-Aware OPD should land last.** Wait for the IBM/KAIST authors'
   code release; if it's not out by the time we want to ship the change, the
   formula is simple enough to re-derive but we should pin a unit test that
   reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
3. **Do not also pull in GKD/DistiLLM/MiniLLM.** Their loss contributions are
   strict subsets of what (TAID + Entropy-Aware OPD + existing
   `generalized_jsd_loss`) covers.
4. **KTO should be added as a backlog item** with a "trigger" condition:
   when the trace-replay reward design moves from preference pairs to per-step
   binary signals, switch on the `trl.KTOTrainer` path.

---

*Absolute path of this report:* `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md`