File size: 23,982 Bytes
b266c31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a84c060
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a5ab7
b266c31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
# Self-Distillation Landscape Audit (feeds ADR-007)

**Status:** research note, pre-experimental
**Author:** subagent audit
**Date:** 2026-05-25
**Scope:** identify 2–3 distillation-channel losses worth adding to
`composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` +
multi-teacher trace-replay DPO stack.
**Bias:** additivity over novelty. We are looking for losses that COMPOSE with
what is already implemented, not duplicates of it.

---

## TL;DR — recommended additions

| Rank | Method | Loss role | License | LOC est. | Why it composes |
|------|--------|-----------|---------|----------|-----------------|
| 1 | **SimPO** (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel |
| 2 | **TAID** (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students |
| 3 | **Entropy-Aware OPD** (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 |

**Honourable mention:** KTO — useful only if the framework wants to ingest
binary thumbs-up/thumbs-down trace signals without preference pairs.
**Not recommended:** GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).

---

## Audit method

For each candidate paper (the seven the user named, plus 2026 follow-ups
discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`)
we verified:

1. **Primary source exists.** arXiv abstract page reachable; HTML body parsed
   to extract the actual loss formula (not summarised from secondary sources).
2. **Code is real.** Official repo's README was fetched, `last push` date and
   star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained
   were marked as such.
3. **License is permissive enough.** MIT, Apache-2.0, BSD, CC BY 4.0 are
   acceptable for inclusion. GPL or research-only would be flagged.
4. **Composability check.** Read the framework's existing
   `composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`,
   then asked: *does this loss replace something we have, or stack on top?*

---

## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED

### Sources
- **arXiv:** https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
- **GitHub:** https://github.com/princeton-nlp/SimPO
  - License: **MIT**
  - 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
  - Built on top of `huggingface/alignment-handbook`
- Maturity: **production-ready**. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.

### Loss core (reference-free preference)
SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory)
with the **average log-probability** of the sequence under the policy, plus
a **target reward margin** γ:

```
r(x, y) = (β / |y|) · log π_θ(y | x)        ← length-normalised implicit reward
                                               (no reference model)

L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
    log σ( r(x, y_w) − r(x, y_l) − γ )
]
```

where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin
between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting
point). Two consequences: (i) no `π_ref` forward pass per step → roughly half
the memory, and (ii) the implicit reward is exactly the quantity the model
generates from at decode time, removing a known DPO pathology where
decoding-time and training-time rewards diverge.

### Why it composes with the existing stack
- The framework's **channel 3** is multi-teacher trace-replay DPO. SimPO is a
  drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)`
  data contract, different loss head. So the trace-replay harvester does not
  change at all.
- It does **not** touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two
  are complementary: JSD-distillation transfers token-level teacher knowledge,
  SimPO sharpens preference structure between trace alternatives.
- It does **not** duplicate GRPO either. GRPO is online-policy RLVR;
  SimPO is offline preference. Different data sources.
- The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points
  on AlpacaEval-2 LC, which directly translates to "if we already have channel-3
  pairs, SimPO is a free upgrade".

### Implementation cost
- **~80 LOC** for the trainer hook; the loss itself is ~15 lines (log-probs,
  length-normalise, margin, BCE).
- Dependencies: nothing new — `torch`, `transformers` already in repo.
- The reference implementation is a single file in `princeton-nlp/SimPO`
  (`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can
  vendor it exactly as we did with OPSD.

---

## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED

### Sources
- **arXiv:** https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
- **GitHub:** https://github.com/SakanaAI/TAID
  - License: **Apache-2.0**
  - 121 stars, last push 2025-10-06 (actively maintained)
  - Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free
- Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale).
- Maturity: **published, single-author commits** but reproducibly trained two SoTA compact models with it.

### Loss core (interpolated teacher target)
Standard distillation losses (forward KL, reverse KL, JSD, including the
`generalized_jsd_loss` we already have) target a **fixed** teacher distribution
`p_T`. TAID replaces this fixed target with a **time-dependent interpolated
target** `p_t` that starts close to the student and moves toward the teacher
as training progresses:

```
p_t(y | x) = (1 − t) · q_θ_stop(y | x)  +  t · p_T(y | x)         (1)

J_TAID(θ; t) = D_KL( p_t ‖ q_θ )                                  (2)
```

`q_θ_stop` is the student's own current distribution with stop-gradient. The
interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an
**adaptive momentum schedule** that grows `t` faster when training loss is
falling and slower when it stalls — this is the "temporally adaptive" part.
The Sakana paper proves (Theorem 4.1) that for the regression analogue this
schedule provably prevents the mode-collapse failure mode of pure
self-distillation.

Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you
can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the
framework already exports**. TAID is therefore a *wrapper around an existing
divergence*, not a competing divergence.

### Why it composes with the existing stack
- It **wraps** `composer_replication.opsd.generalized_jsd_loss` rather than
  replacing it. The change is "compute the JSD against `p_t` instead of
  `p_T`" — a few lines around the existing call site.
- Addresses a documented weakness of OPSD-style self-distillation: when the
  teacher's privileged-context distribution is far from the student's
  capacity, the JSD signal can be noisy or push the student into mode
  averaging. TAID's annealed target gives the student a curriculum.
- Empirical evidence the Sakana paper directly compares with: TAID + JSD
  beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation,
  with **0.7 h / epoch** vs **9.8 h / epoch** for GKD on identical hardware.
  The speed comes from not needing student-generated outputs (SGOs) at every
  step the way GKD does.
- Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO)
  because TAID lives strictly inside channel 2.

### Implementation cost
- **~150 LOC**. The change is:
  1. A `TAIDState` object that holds `t`, the EMA of training loss, and the
     momentum coefficient β (default 0.99).
  2. A function `taid_target(student_logits, teacher_logits, t)` that returns
     `(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`.
  3. A scheduler hook that updates `t` after each backward pass per
     Algorithm 1 of the paper.
- Dependencies: nothing new.
- Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is
  Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.

---

## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED

### Sources
- **OpenReview (ICLR 2026 Spotlight):** https://openreview.net/forum?id=WSRQ37tzk1
- **IBM Research page:** https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
- Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
- Status: **ICLR 2026 Spotlight**, submission #113. License on the OpenReview record is **CC BY 4.0**.
- Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. **Maturity flag: paper-ready, code-pending.** This is the only candidate where we'd need to re-implement from the paper.

### Loss core (entropy-gated forward/reverse KL mixture)
The paper diagnoses a failure mode in the reverse-KL-on-policy distillation
recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the
**teacher distribution has high entropy at a given token**, reverse KL's
mode-seeking gradient becomes noisy and collapses the student's diversity.
Their fix: at each token `t`, gate between forward and reverse KL based on
the teacher's entropy:

```
H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t)        (teacher entropy)

α_t = sigmoid( (H_t − τ) / s )                              ∈ (0, 1)

L_EA(θ) = E_{y ~ q_θ} Σ_t [
    (1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) )    ← reverse KL
  +     α_t   · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) )    ← forward KL
]
```

`τ` is an entropy threshold (default ≈ 1.0 nat in their experiments) and `s`
is a temperature controlling how sharp the gate is. When the teacher is
confident (`H_t` small → `α_t ≈ 0`) the loss is pure reverse KL, identical to
MiniLLM/OPSD behaviour. When the teacher is uncertain (`H_t` large → `α_t ≈ 1`)
the loss switches to forward KL, which is mode-covering and preserves
student diversity.

Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on
six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The
larger gains at larger student size suggest the failure mode reverse KL
exhibits gets *worse* with capacity, not better.

### Why it composes with the existing stack
- It is **strictly token-wise**: same trajectory, same teacher logits, same
  rollout pipeline as the existing channel 2. The only change is the loss
  reduction — instead of computing `generalized_jsd_loss` with a single fixed
  β, you compute a per-token mixture of forward and reverse KL with weight
  given by teacher entropy.
- This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is
  *privileged-context teacher distribution under student rollouts*. EA-OPD's
  contribution is *which divergence to use at each token of that distribution*.
  Both can be true simultaneously.
- Directly addresses a failure mode the framework's roadmap will hit:
  multi-teacher trace replay (channel 3) produces high-entropy aggregated
  teacher distributions at exactly the steps where teachers disagree. Those
  are the steps where reverse KL behaves worst. EA-OPD's entropy gate would
  automatically soften the loss on those exact tokens.
- Composes with TAID (Candidate 2) too — they operate on different axes:
  TAID anneals the *target distribution*, EA-OPD chooses the *divergence
  direction*. Stacking is straightforward and proposed as ADR-007 follow-up.

### Implementation cost
- **~120 LOC** estimate (no reference code to vendor yet).
- Dependencies: nothing new. Token-level entropy is `−(p * log p).sum(-1)`,
  forward KL is the existing teacher-on-student term, reverse KL is the
  student-on-teacher term we already compute for the JSD in OPSD. The work is
  re-shaping the existing per-token loss to expose both directions.
- **Risk note:** code not yet public. We should hold this candidate behind a
  feature flag until the IBM/KAIST team releases reference code (expected by
  ICLR 2026 in May). If the implementation ships sooner we should vendor and
  match line-for-line; if not, we re-derive from the paper formula and add a
  unit test that reproduces their toy entropy-vs-divergence plot.

---

## Honourable mention — KTO (Kahneman-Tversky Optimization)

- **arXiv:** https://arxiv.org/abs/2402.01306
- **Code:** integrated into HuggingFace `trl` library since v0.8 (Apache-2.0).
- License/maturity: **production**. KTO is a standard `trl` trainer alongside DPO.

### Loss core
KTO replaces preference pairs with **per-output binary desirability** signals.
For a desirable output `y_+` and undesirable output `y_−`:

```
r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )

z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ]      (reference point)

L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))]        (desirable)
      + E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))]        (undesirable)
```

with default `λ_D = λ_U = 1`. The derivation is via prospect theory: this is
a Kahneman-Tversky utility function applied to the implicit reward. KTO
matches DPO at 1B–30B even though it sees only `2n` binary signals where
DPO sees `n` pairs.

### Why we down-rank it relative to the top-3
KTO is the right answer **only if** the framework wants to ingest single-side
trace signals (e.g., "this trace step succeeded" / "this step crashed the
agent") without constructing pairs. The current
`research/05-trace-replay-distillation.md` design **does** construct pairs
from multi-teacher replay (that is the whole point of the multi-teacher
variance signal), so the marginal value of KTO is small *for channel 3 as
specified*. If the trace-replay design pivots toward absolute scores per
step rather than relative pairs, KTO becomes the right loss and is already
free from `trl`. Add to the backlog as conditional.

---

## Audited but NOT recommended

### GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)
- **arXiv:** https://arxiv.org/abs/2306.13649 (Google DeepMind)
- **Loss core:** student samples its own outputs, teacher provides token
  probabilities, divergence is generalized JSD with parameter β:
  ```
  D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q)
  ```
- **Why excluded:** **this is exactly the formula we already have** as
  `composer_replication.opsd.generalized_jsd_loss` (lifted from
  `siyan-zhao/OPSD`). GKD's contribution beyond the loss formula is the
  on-policy student sampling protocol — which OPSD also does. No incremental
  value to add.

### DistiLLM (Ko et al., ICML 2024)
- **arXiv:** https://arxiv.org/abs/2402.03898
- **GitHub:** https://github.com/jongwooko/distillm — MIT, last push 2025-03
- **Loss core:** *Skew KL divergence* `KL(p ‖ λp + (1−λ)q)` plus an *adaptive
  off-policy* student-generated-output (SGO) scheduler.
- **Why excluded:** the skew-KL is a special case of generalized JSD (set the
  mixture coefficient appropriately) — same family the framework already
  has. The interesting contribution, the SGO scheduler, is a process
  optimisation, not a loss. The TAID paper's own ablation (Table 6) shows
  TAID > Skew KL across student sizes, so TAID dominates this candidate.

### MiniLLM (Gu et al., ICLR 2024)
- **arXiv:** https://arxiv.org/abs/2306.08543
- **GitHub:** https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo
  active (last push 2026-04)
- **Loss core:** reverse KL minimised by policy-gradient on student rollouts,
  with three optimisation tricks: single-step decomposition (variance
  reduction), teacher-mixed sampling (anti-reward-hacking), length
  normalisation.
- **Why excluded:** reverse-KL on-policy distillation **is the same recipe
  family as SDPO/OPSD** the framework already implements. Adding MiniLLM
  would be a parallel implementation of the same idea, not an addition.
  Entropy-Aware OPD (Candidate 3) is a *strict improvement* over MiniLLM's
  pure reverse-KL on exactly the failure mode MiniLLM identifies (mode
  collapse in high-entropy regions).

### Self-Rewarding Language Models (Yuan et al., 2024)
- **arXiv:** https://arxiv.org/abs/2401.10020 (Meta + NYU)
- **Why excluded:** SRLM is a *training procedure* (iterative DPO with the
  model judging its own outputs), not a loss. The actual loss is plain DPO,
  which the framework already supports. The procedural contribution belongs
  in a future ADR on data generation, not in the distillation channel.

### TAID's relationship to "TAID arXiv 2501.16937 if it exists"
The user asked us to verify existence. **It exists.** Submitted 2025-01-28,
ICLR 2025, code at https://github.com/SakanaAI/TAID with two released
checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source.

---

## 2026 papers found

The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`)
surfaced four 2026 distillation papers worth listing for completeness:

1. **Entropy-Aware On-Policy Distillation** — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
2. **KL for a KL: On-Policy Distillation with Control Variate Baseline** (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
3. **Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe** (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
4. **Hybrid Policy Distillation for LLMs** (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
5. **Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation** (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.

None of these except Entropy-Aware OPD are mature enough (released code +
license + reproducible scale) to recommend adding right now.

---

## Recommended follow-up wiring

For ADR-007 the proposed addition is a `composer_replication.distillation`
sub-package with three pluggable hooks:

> **Realised in v0.1 (Wave 17 update):** ADR-007 shipped a flatter
> layout than the proposal below. Actual exports:
>
> ```
> composer_replication/
>   distillation/
>     __init__.py
>     simpo.py              # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma)
>                           # avg_sequence_logprob(logprobs, mask)  -- helper
>     taid.py               # taid_loss(student_logits, teacher_logits, t, *, ...)
>                           # TAIDScheduler  -- adaptive momentum schedule per the paper
>     entropy_aware_opd.py  # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)
> ```
>
> No `targets.py`/`losses.py` split, no top-level `preference/` package,
> and SimPO lives under `distillation/` rather than `preference/` because
> the three losses share a common dispatch surface (`compose_loss`'s
> `dpo_variant` and `sdpo_wrapper` switches).
>
> The composition rule realised in `compose_loss` is per-loss flag-driven,
> not a single composed-function call:
>
> ```python
> compose_loss(model, inputs,
>     dpo_variant="simpo",          # OR "dpo" (default)
>     sdpo_wrapper="taid",          # OR "entropy_opd" OR "none" (default)
>     taid_t=0.5,                    # required when sdpo_wrapper="taid"
>     simpo_beta=2.0, simpo_gamma=1.0,  # used only when dpo_variant="simpo"
>     entropy_opd_h_max=...,         # used only when sdpo_wrapper="entropy_opd"
> )
> ```
>
> The pre-ADR proposal sketch below is preserved as historical context.
> The shipped function names are `simpo_loss`, `taid_loss` +
> `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` /
> `entropy_aware_kl_loss`).

```
composer_replication/
  distillation/
    __init__.py
    targets.py        # taid_target(...), fixed_target(...)         ← Candidate 2
    losses.py         # reuses opsd.generalized_jsd_loss
                       # adds entropy_aware_kl_loss(...)             ← Candidate 3
  preference/
    simpo.py          # simpo_loss(...)                              ← Candidate 1
    dpo.py            # existing trace-replay path
```

The composition rule for the total loss becomes:

```
L_total =   λ_grpo · L_GRPO            (channel 1, unchanged)
        + λ_distill · L_distill        (channel 2, see below)
        +    λ_pref · L_pref           (channel 3, choose DPO or SimPO)

L_distill = entropy_aware_kl_loss(
                target = taid_target(student, teacher, t),
                student = student,
                teacher_entropy_gate = α_t
            )
```

This keeps the existing `generalized_jsd_loss` reachable as a fallback
(set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly).

---

## Sources index

| Paper | arXiv | GitHub | License | Last push | Maturity |
|-------|-------|--------|---------|-----------|----------|
| SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production |
| TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production |
| Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only |
| KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production |
| GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only |
| DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research |
| MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production |
| Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss |

---

## Notes for ADR-007 author

1. **SimPO and TAID can land independently and without coordination.** They
   touch different files and do not compete.
2. **Entropy-Aware OPD should land last.** Wait for the IBM/KAIST authors'
   code release; if it's not out by the time we want to ship the change, the
   formula is simple enough to re-derive but we should pin a unit test that
   reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
3. **Do not also pull in GKD/DistiLLM/MiniLLM.** Their loss contributions are
   strict subsets of what (TAID + Entropy-Aware OPD + existing
   `generalized_jsd_loss`) covers.
4. **KTO should be added as a backlog item** with a "trigger" condition:
   when the trace-replay reward design moves from preference pairs to per-step
   binary signals, switch on the `trl.KTOTrainer` path.

---

*Absolute path of this report:* `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md`