Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 23,982 Bytes
b266c31 a84c060 c0a5ab7 b266c31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 | # Self-Distillation Landscape Audit (feeds ADR-007)
**Status:** research note, pre-experimental
**Author:** subagent audit
**Date:** 2026-05-25
**Scope:** identify 2–3 distillation-channel losses worth adding to
`composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` +
multi-teacher trace-replay DPO stack.
**Bias:** additivity over novelty. We are looking for losses that COMPOSE with
what is already implemented, not duplicates of it.
---
## TL;DR — recommended additions
| Rank | Method | Loss role | License | LOC est. | Why it composes |
|------|--------|-----------|---------|----------|-----------------|
| 1 | **SimPO** (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel |
| 2 | **TAID** (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students |
| 3 | **Entropy-Aware OPD** (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 |
**Honourable mention:** KTO — useful only if the framework wants to ingest
binary thumbs-up/thumbs-down trace signals without preference pairs.
**Not recommended:** GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).
---
## Audit method
For each candidate paper (the seven the user named, plus 2026 follow-ups
discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`)
we verified:
1. **Primary source exists.** arXiv abstract page reachable; HTML body parsed
to extract the actual loss formula (not summarised from secondary sources).
2. **Code is real.** Official repo's README was fetched, `last push` date and
star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained
were marked as such.
3. **License is permissive enough.** MIT, Apache-2.0, BSD, CC BY 4.0 are
acceptable for inclusion. GPL or research-only would be flagged.
4. **Composability check.** Read the framework's existing
`composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`,
then asked: *does this loss replace something we have, or stack on top?*
---
## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED
### Sources
- **arXiv:** https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
- **GitHub:** https://github.com/princeton-nlp/SimPO
- License: **MIT**
- 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
- Built on top of `huggingface/alignment-handbook`
- Maturity: **production-ready**. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.
### Loss core (reference-free preference)
SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory)
with the **average log-probability** of the sequence under the policy, plus
a **target reward margin** γ:
```
r(x, y) = (β / |y|) · log π_θ(y | x) ← length-normalised implicit reward
(no reference model)
L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
log σ( r(x, y_w) − r(x, y_l) − γ )
]
```
where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin
between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting
point). Two consequences: (i) no `π_ref` forward pass per step → roughly half
the memory, and (ii) the implicit reward is exactly the quantity the model
generates from at decode time, removing a known DPO pathology where
decoding-time and training-time rewards diverge.
### Why it composes with the existing stack
- The framework's **channel 3** is multi-teacher trace-replay DPO. SimPO is a
drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)`
data contract, different loss head. So the trace-replay harvester does not
change at all.
- It does **not** touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two
are complementary: JSD-distillation transfers token-level teacher knowledge,
SimPO sharpens preference structure between trace alternatives.
- It does **not** duplicate GRPO either. GRPO is online-policy RLVR;
SimPO is offline preference. Different data sources.
- The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points
on AlpacaEval-2 LC, which directly translates to "if we already have channel-3
pairs, SimPO is a free upgrade".
### Implementation cost
- **~80 LOC** for the trainer hook; the loss itself is ~15 lines (log-probs,
length-normalise, margin, BCE).
- Dependencies: nothing new — `torch`, `transformers` already in repo.
- The reference implementation is a single file in `princeton-nlp/SimPO`
(`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can
vendor it exactly as we did with OPSD.
---
## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED
### Sources
- **arXiv:** https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
- **GitHub:** https://github.com/SakanaAI/TAID
- License: **Apache-2.0**
- 121 stars, last push 2025-10-06 (actively maintained)
- Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free
- Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale).
- Maturity: **published, single-author commits** but reproducibly trained two SoTA compact models with it.
### Loss core (interpolated teacher target)
Standard distillation losses (forward KL, reverse KL, JSD, including the
`generalized_jsd_loss` we already have) target a **fixed** teacher distribution
`p_T`. TAID replaces this fixed target with a **time-dependent interpolated
target** `p_t` that starts close to the student and moves toward the teacher
as training progresses:
```
p_t(y | x) = (1 − t) · q_θ_stop(y | x) + t · p_T(y | x) (1)
J_TAID(θ; t) = D_KL( p_t ‖ q_θ ) (2)
```
`q_θ_stop` is the student's own current distribution with stop-gradient. The
interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an
**adaptive momentum schedule** that grows `t` faster when training loss is
falling and slower when it stalls — this is the "temporally adaptive" part.
The Sakana paper proves (Theorem 4.1) that for the regression analogue this
schedule provably prevents the mode-collapse failure mode of pure
self-distillation.
Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you
can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the
framework already exports**. TAID is therefore a *wrapper around an existing
divergence*, not a competing divergence.
### Why it composes with the existing stack
- It **wraps** `composer_replication.opsd.generalized_jsd_loss` rather than
replacing it. The change is "compute the JSD against `p_t` instead of
`p_T`" — a few lines around the existing call site.
- Addresses a documented weakness of OPSD-style self-distillation: when the
teacher's privileged-context distribution is far from the student's
capacity, the JSD signal can be noisy or push the student into mode
averaging. TAID's annealed target gives the student a curriculum.
- Empirical evidence the Sakana paper directly compares with: TAID + JSD
beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation,
with **0.7 h / epoch** vs **9.8 h / epoch** for GKD on identical hardware.
The speed comes from not needing student-generated outputs (SGOs) at every
step the way GKD does.
- Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO)
because TAID lives strictly inside channel 2.
### Implementation cost
- **~150 LOC**. The change is:
1. A `TAIDState` object that holds `t`, the EMA of training loss, and the
momentum coefficient β (default 0.99).
2. A function `taid_target(student_logits, teacher_logits, t)` that returns
`(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`.
3. A scheduler hook that updates `t` after each backward pass per
Algorithm 1 of the paper.
- Dependencies: nothing new.
- Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is
Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.
---
## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED
### Sources
- **OpenReview (ICLR 2026 Spotlight):** https://openreview.net/forum?id=WSRQ37tzk1
- **IBM Research page:** https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
- Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
- Status: **ICLR 2026 Spotlight**, submission #113. License on the OpenReview record is **CC BY 4.0**.
- Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. **Maturity flag: paper-ready, code-pending.** This is the only candidate where we'd need to re-implement from the paper.
### Loss core (entropy-gated forward/reverse KL mixture)
The paper diagnoses a failure mode in the reverse-KL-on-policy distillation
recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the
**teacher distribution has high entropy at a given token**, reverse KL's
mode-seeking gradient becomes noisy and collapses the student's diversity.
Their fix: at each token `t`, gate between forward and reverse KL based on
the teacher's entropy:
```
H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t) (teacher entropy)
α_t = sigmoid( (H_t − τ) / s ) ∈ (0, 1)
L_EA(θ) = E_{y ~ q_θ} Σ_t [
(1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) ) ← reverse KL
+ α_t · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) ) ← forward KL
]
```
`τ` is an entropy threshold (default ≈ 1.0 nat in their experiments) and `s`
is a temperature controlling how sharp the gate is. When the teacher is
confident (`H_t` small → `α_t ≈ 0`) the loss is pure reverse KL, identical to
MiniLLM/OPSD behaviour. When the teacher is uncertain (`H_t` large → `α_t ≈ 1`)
the loss switches to forward KL, which is mode-covering and preserves
student diversity.
Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on
six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The
larger gains at larger student size suggest the failure mode reverse KL
exhibits gets *worse* with capacity, not better.
### Why it composes with the existing stack
- It is **strictly token-wise**: same trajectory, same teacher logits, same
rollout pipeline as the existing channel 2. The only change is the loss
reduction — instead of computing `generalized_jsd_loss` with a single fixed
β, you compute a per-token mixture of forward and reverse KL with weight
given by teacher entropy.
- This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is
*privileged-context teacher distribution under student rollouts*. EA-OPD's
contribution is *which divergence to use at each token of that distribution*.
Both can be true simultaneously.
- Directly addresses a failure mode the framework's roadmap will hit:
multi-teacher trace replay (channel 3) produces high-entropy aggregated
teacher distributions at exactly the steps where teachers disagree. Those
are the steps where reverse KL behaves worst. EA-OPD's entropy gate would
automatically soften the loss on those exact tokens.
- Composes with TAID (Candidate 2) too — they operate on different axes:
TAID anneals the *target distribution*, EA-OPD chooses the *divergence
direction*. Stacking is straightforward and proposed as ADR-007 follow-up.
### Implementation cost
- **~120 LOC** estimate (no reference code to vendor yet).
- Dependencies: nothing new. Token-level entropy is `−(p * log p).sum(-1)`,
forward KL is the existing teacher-on-student term, reverse KL is the
student-on-teacher term we already compute for the JSD in OPSD. The work is
re-shaping the existing per-token loss to expose both directions.
- **Risk note:** code not yet public. We should hold this candidate behind a
feature flag until the IBM/KAIST team releases reference code (expected by
ICLR 2026 in May). If the implementation ships sooner we should vendor and
match line-for-line; if not, we re-derive from the paper formula and add a
unit test that reproduces their toy entropy-vs-divergence plot.
---
## Honourable mention — KTO (Kahneman-Tversky Optimization)
- **arXiv:** https://arxiv.org/abs/2402.01306
- **Code:** integrated into HuggingFace `trl` library since v0.8 (Apache-2.0).
- License/maturity: **production**. KTO is a standard `trl` trainer alongside DPO.
### Loss core
KTO replaces preference pairs with **per-output binary desirability** signals.
For a desirable output `y_+` and undesirable output `y_−`:
```
r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )
z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ] (reference point)
L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))] (desirable)
+ E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))] (undesirable)
```
with default `λ_D = λ_U = 1`. The derivation is via prospect theory: this is
a Kahneman-Tversky utility function applied to the implicit reward. KTO
matches DPO at 1B–30B even though it sees only `2n` binary signals where
DPO sees `n` pairs.
### Why we down-rank it relative to the top-3
KTO is the right answer **only if** the framework wants to ingest single-side
trace signals (e.g., "this trace step succeeded" / "this step crashed the
agent") without constructing pairs. The current
`research/05-trace-replay-distillation.md` design **does** construct pairs
from multi-teacher replay (that is the whole point of the multi-teacher
variance signal), so the marginal value of KTO is small *for channel 3 as
specified*. If the trace-replay design pivots toward absolute scores per
step rather than relative pairs, KTO becomes the right loss and is already
free from `trl`. Add to the backlog as conditional.
---
## Audited but NOT recommended
### GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)
- **arXiv:** https://arxiv.org/abs/2306.13649 (Google DeepMind)
- **Loss core:** student samples its own outputs, teacher provides token
probabilities, divergence is generalized JSD with parameter β:
```
D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q)
```
- **Why excluded:** **this is exactly the formula we already have** as
`composer_replication.opsd.generalized_jsd_loss` (lifted from
`siyan-zhao/OPSD`). GKD's contribution beyond the loss formula is the
on-policy student sampling protocol — which OPSD also does. No incremental
value to add.
### DistiLLM (Ko et al., ICML 2024)
- **arXiv:** https://arxiv.org/abs/2402.03898
- **GitHub:** https://github.com/jongwooko/distillm — MIT, last push 2025-03
- **Loss core:** *Skew KL divergence* `KL(p ‖ λp + (1−λ)q)` plus an *adaptive
off-policy* student-generated-output (SGO) scheduler.
- **Why excluded:** the skew-KL is a special case of generalized JSD (set the
mixture coefficient appropriately) — same family the framework already
has. The interesting contribution, the SGO scheduler, is a process
optimisation, not a loss. The TAID paper's own ablation (Table 6) shows
TAID > Skew KL across student sizes, so TAID dominates this candidate.
### MiniLLM (Gu et al., ICLR 2024)
- **arXiv:** https://arxiv.org/abs/2306.08543
- **GitHub:** https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo
active (last push 2026-04)
- **Loss core:** reverse KL minimised by policy-gradient on student rollouts,
with three optimisation tricks: single-step decomposition (variance
reduction), teacher-mixed sampling (anti-reward-hacking), length
normalisation.
- **Why excluded:** reverse-KL on-policy distillation **is the same recipe
family as SDPO/OPSD** the framework already implements. Adding MiniLLM
would be a parallel implementation of the same idea, not an addition.
Entropy-Aware OPD (Candidate 3) is a *strict improvement* over MiniLLM's
pure reverse-KL on exactly the failure mode MiniLLM identifies (mode
collapse in high-entropy regions).
### Self-Rewarding Language Models (Yuan et al., 2024)
- **arXiv:** https://arxiv.org/abs/2401.10020 (Meta + NYU)
- **Why excluded:** SRLM is a *training procedure* (iterative DPO with the
model judging its own outputs), not a loss. The actual loss is plain DPO,
which the framework already supports. The procedural contribution belongs
in a future ADR on data generation, not in the distillation channel.
### TAID's relationship to "TAID arXiv 2501.16937 if it exists"
The user asked us to verify existence. **It exists.** Submitted 2025-01-28,
ICLR 2025, code at https://github.com/SakanaAI/TAID with two released
checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source.
---
## 2026 papers found
The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`)
surfaced four 2026 distillation papers worth listing for completeness:
1. **Entropy-Aware On-Policy Distillation** — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
2. **KL for a KL: On-Policy Distillation with Control Variate Baseline** (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
3. **Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe** (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
4. **Hybrid Policy Distillation for LLMs** (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
5. **Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation** (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.
None of these except Entropy-Aware OPD are mature enough (released code +
license + reproducible scale) to recommend adding right now.
---
## Recommended follow-up wiring
For ADR-007 the proposed addition is a `composer_replication.distillation`
sub-package with three pluggable hooks:
> **Realised in v0.1 (Wave 17 update):** ADR-007 shipped a flatter
> layout than the proposal below. Actual exports:
>
> ```
> composer_replication/
> distillation/
> __init__.py
> simpo.py # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma)
> # avg_sequence_logprob(logprobs, mask) -- helper
> taid.py # taid_loss(student_logits, teacher_logits, t, *, ...)
> # TAIDScheduler -- adaptive momentum schedule per the paper
> entropy_aware_opd.py # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)
> ```
>
> No `targets.py`/`losses.py` split, no top-level `preference/` package,
> and SimPO lives under `distillation/` rather than `preference/` because
> the three losses share a common dispatch surface (`compose_loss`'s
> `dpo_variant` and `sdpo_wrapper` switches).
>
> The composition rule realised in `compose_loss` is per-loss flag-driven,
> not a single composed-function call:
>
> ```python
> compose_loss(model, inputs,
> dpo_variant="simpo", # OR "dpo" (default)
> sdpo_wrapper="taid", # OR "entropy_opd" OR "none" (default)
> taid_t=0.5, # required when sdpo_wrapper="taid"
> simpo_beta=2.0, simpo_gamma=1.0, # used only when dpo_variant="simpo"
> entropy_opd_h_max=..., # used only when sdpo_wrapper="entropy_opd"
> )
> ```
>
> The pre-ADR proposal sketch below is preserved as historical context.
> The shipped function names are `simpo_loss`, `taid_loss` +
> `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` /
> `entropy_aware_kl_loss`).
```
composer_replication/
distillation/
__init__.py
targets.py # taid_target(...), fixed_target(...) ← Candidate 2
losses.py # reuses opsd.generalized_jsd_loss
# adds entropy_aware_kl_loss(...) ← Candidate 3
preference/
simpo.py # simpo_loss(...) ← Candidate 1
dpo.py # existing trace-replay path
```
The composition rule for the total loss becomes:
```
L_total = λ_grpo · L_GRPO (channel 1, unchanged)
+ λ_distill · L_distill (channel 2, see below)
+ λ_pref · L_pref (channel 3, choose DPO or SimPO)
L_distill = entropy_aware_kl_loss(
target = taid_target(student, teacher, t),
student = student,
teacher_entropy_gate = α_t
)
```
This keeps the existing `generalized_jsd_loss` reachable as a fallback
(set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly).
---
## Sources index
| Paper | arXiv | GitHub | License | Last push | Maturity |
|-------|-------|--------|---------|-----------|----------|
| SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production |
| TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production |
| Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only |
| KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production |
| GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only |
| DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research |
| MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production |
| Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss |
---
## Notes for ADR-007 author
1. **SimPO and TAID can land independently and without coordination.** They
touch different files and do not compete.
2. **Entropy-Aware OPD should land last.** Wait for the IBM/KAIST authors'
code release; if it's not out by the time we want to ship the change, the
formula is simple enough to re-derive but we should pin a unit test that
reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
3. **Do not also pull in GKD/DistiLLM/MiniLLM.** Their loss contributions are
strict subsets of what (TAID + Entropy-Aware OPD + existing
`generalized_jsd_loss`) covers.
4. **KTO should be added as a backlog item** with a "trigger" condition:
when the trace-replay reward design moves from preference pairs to per-step
binary signals, switch on the `trl.KTOTrainer` path.
---
*Absolute path of this report:* `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md`
|