Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Self-Distillation Landscape Audit (feeds ADR-007) | |
| **Status:** research note, pre-experimental | |
| **Author:** subagent audit | |
| **Date:** 2026-05-25 | |
| **Scope:** identify 2–3 distillation-channel losses worth adding to | |
| `composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` + | |
| multi-teacher trace-replay DPO stack. | |
| **Bias:** additivity over novelty. We are looking for losses that COMPOSE with | |
| what is already implemented, not duplicates of it. | |
| --- | |
| ## TL;DR — recommended additions | |
| | Rank | Method | Loss role | License | LOC est. | Why it composes | | |
| |------|--------|-----------|---------|----------|-----------------| | |
| | 1 | **SimPO** (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel | | |
| | 2 | **TAID** (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students | | |
| | 3 | **Entropy-Aware OPD** (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 | | |
| **Honourable mention:** KTO — useful only if the framework wants to ingest | |
| binary thumbs-up/thumbs-down trace signals without preference pairs. | |
| **Not recommended:** GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end). | |
| --- | |
| ## Audit method | |
| For each candidate paper (the seven the user named, plus 2026 follow-ups | |
| discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`) | |
| we verified: | |
| 1. **Primary source exists.** arXiv abstract page reachable; HTML body parsed | |
| to extract the actual loss formula (not summarised from secondary sources). | |
| 2. **Code is real.** Official repo's README was fetched, `last push` date and | |
| star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained | |
| were marked as such. | |
| 3. **License is permissive enough.** MIT, Apache-2.0, BSD, CC BY 4.0 are | |
| acceptable for inclusion. GPL or research-only would be flagged. | |
| 4. **Composability check.** Read the framework's existing | |
| `composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`, | |
| then asked: *does this loss replace something we have, or stack on top?* | |
| --- | |
| ## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED | |
| ### Sources | |
| - **arXiv:** https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024) | |
| - **GitHub:** https://github.com/princeton-nlp/SimPO | |
| - License: **MIT** | |
| - 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS) | |
| - Built on top of `huggingface/alignment-handbook` | |
| - Maturity: **production-ready**. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo. | |
| ### Loss core (reference-free preference) | |
| SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory) | |
| with the **average log-probability** of the sequence under the policy, plus | |
| a **target reward margin** γ: | |
| ``` | |
| r(x, y) = (β / |y|) · log π_θ(y | x) ← length-normalised implicit reward | |
| (no reference model) | |
| L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [ | |
| log σ( r(x, y_w) − r(x, y_l) − γ ) | |
| ] | |
| ``` | |
| where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin | |
| between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting | |
| point). Two consequences: (i) no `π_ref` forward pass per step → roughly half | |
| the memory, and (ii) the implicit reward is exactly the quantity the model | |
| generates from at decode time, removing a known DPO pathology where | |
| decoding-time and training-time rewards diverge. | |
| ### Why it composes with the existing stack | |
| - The framework's **channel 3** is multi-teacher trace-replay DPO. SimPO is a | |
| drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)` | |
| data contract, different loss head. So the trace-replay harvester does not | |
| change at all. | |
| - It does **not** touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two | |
| are complementary: JSD-distillation transfers token-level teacher knowledge, | |
| SimPO sharpens preference structure between trace alternatives. | |
| - It does **not** duplicate GRPO either. GRPO is online-policy RLVR; | |
| SimPO is offline preference. Different data sources. | |
| - The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points | |
| on AlpacaEval-2 LC, which directly translates to "if we already have channel-3 | |
| pairs, SimPO is a free upgrade". | |
| ### Implementation cost | |
| - **~80 LOC** for the trainer hook; the loss itself is ~15 lines (log-probs, | |
| length-normalise, margin, BCE). | |
| - Dependencies: nothing new — `torch`, `transformers` already in repo. | |
| - The reference implementation is a single file in `princeton-nlp/SimPO` | |
| (`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can | |
| vendor it exactly as we did with OPSD. | |
| --- | |
| ## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED | |
| ### Sources | |
| - **arXiv:** https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025) | |
| - **GitHub:** https://github.com/SakanaAI/TAID | |
| - License: **Apache-2.0** | |
| - 121 stars, last push 2025-10-06 (actively maintained) | |
| - Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free | |
| - Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale). | |
| - Maturity: **published, single-author commits** but reproducibly trained two SoTA compact models with it. | |
| ### Loss core (interpolated teacher target) | |
| Standard distillation losses (forward KL, reverse KL, JSD, including the | |
| `generalized_jsd_loss` we already have) target a **fixed** teacher distribution | |
| `p_T`. TAID replaces this fixed target with a **time-dependent interpolated | |
| target** `p_t` that starts close to the student and moves toward the teacher | |
| as training progresses: | |
| ``` | |
| p_t(y | x) = (1 − t) · q_θ_stop(y | x) + t · p_T(y | x) (1) | |
| J_TAID(θ; t) = D_KL( p_t ‖ q_θ ) (2) | |
| ``` | |
| `q_θ_stop` is the student's own current distribution with stop-gradient. The | |
| interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an | |
| **adaptive momentum schedule** that grows `t` faster when training loss is | |
| falling and slower when it stalls — this is the "temporally adaptive" part. | |
| The Sakana paper proves (Theorem 4.1) that for the regression analogue this | |
| schedule provably prevents the mode-collapse failure mode of pure | |
| self-distillation. | |
| Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you | |
| can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the | |
| framework already exports**. TAID is therefore a *wrapper around an existing | |
| divergence*, not a competing divergence. | |
| ### Why it composes with the existing stack | |
| - It **wraps** `composer_replication.opsd.generalized_jsd_loss` rather than | |
| replacing it. The change is "compute the JSD against `p_t` instead of | |
| `p_T`" — a few lines around the existing call site. | |
| - Addresses a documented weakness of OPSD-style self-distillation: when the | |
| teacher's privileged-context distribution is far from the student's | |
| capacity, the JSD signal can be noisy or push the student into mode | |
| averaging. TAID's annealed target gives the student a curriculum. | |
| - Empirical evidence the Sakana paper directly compares with: TAID + JSD | |
| beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation, | |
| with **0.7 h / epoch** vs **9.8 h / epoch** for GKD on identical hardware. | |
| The speed comes from not needing student-generated outputs (SGOs) at every | |
| step the way GKD does. | |
| - Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO) | |
| because TAID lives strictly inside channel 2. | |
| ### Implementation cost | |
| - **~150 LOC**. The change is: | |
| 1. A `TAIDState` object that holds `t`, the EMA of training loss, and the | |
| momentum coefficient β (default 0.99). | |
| 2. A function `taid_target(student_logits, teacher_logits, t)` that returns | |
| `(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`. | |
| 3. A scheduler hook that updates `t` after each backward pass per | |
| Algorithm 1 of the paper. | |
| - Dependencies: nothing new. | |
| - Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is | |
| Apache-2.0 — vendor-friendly, same pattern as our OPSD lift. | |
| --- | |
| ## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED | |
| ### Sources | |
| - **OpenReview (ICLR 2026 Spotlight):** https://openreview.net/forum?id=WSRQ37tzk1 | |
| - **IBM Research page:** https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models | |
| - Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research) | |
| - Status: **ICLR 2026 Spotlight**, submission #113. License on the OpenReview record is **CC BY 4.0**. | |
| - Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. **Maturity flag: paper-ready, code-pending.** This is the only candidate where we'd need to re-implement from the paper. | |
| ### Loss core (entropy-gated forward/reverse KL mixture) | |
| The paper diagnoses a failure mode in the reverse-KL-on-policy distillation | |
| recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the | |
| **teacher distribution has high entropy at a given token**, reverse KL's | |
| mode-seeking gradient becomes noisy and collapses the student's diversity. | |
| Their fix: at each token `t`, gate between forward and reverse KL based on | |
| the teacher's entropy: | |
| ``` | |
| H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t) (teacher entropy) | |
| α_t = sigmoid( (H_t − τ) / s ) ∈ (0, 1) | |
| L_EA(θ) = E_{y ~ q_θ} Σ_t [ | |
| (1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) ) ← reverse KL | |
| + α_t · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) ) ← forward KL | |
| ] | |
| ``` | |
| `τ` is an entropy threshold (default ≈ 1.0 nat in their experiments) and `s` | |
| is a temperature controlling how sharp the gate is. When the teacher is | |
| confident (`H_t` small → `α_t ≈ 0`) the loss is pure reverse KL, identical to | |
| MiniLLM/OPSD behaviour. When the teacher is uncertain (`H_t` large → `α_t ≈ 1`) | |
| the loss switches to forward KL, which is mode-covering and preserves | |
| student diversity. | |
| Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on | |
| six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The | |
| larger gains at larger student size suggest the failure mode reverse KL | |
| exhibits gets *worse* with capacity, not better. | |
| ### Why it composes with the existing stack | |
| - It is **strictly token-wise**: same trajectory, same teacher logits, same | |
| rollout pipeline as the existing channel 2. The only change is the loss | |
| reduction — instead of computing `generalized_jsd_loss` with a single fixed | |
| β, you compute a per-token mixture of forward and reverse KL with weight | |
| given by teacher entropy. | |
| - This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is | |
| *privileged-context teacher distribution under student rollouts*. EA-OPD's | |
| contribution is *which divergence to use at each token of that distribution*. | |
| Both can be true simultaneously. | |
| - Directly addresses a failure mode the framework's roadmap will hit: | |
| multi-teacher trace replay (channel 3) produces high-entropy aggregated | |
| teacher distributions at exactly the steps where teachers disagree. Those | |
| are the steps where reverse KL behaves worst. EA-OPD's entropy gate would | |
| automatically soften the loss on those exact tokens. | |
| - Composes with TAID (Candidate 2) too — they operate on different axes: | |
| TAID anneals the *target distribution*, EA-OPD chooses the *divergence | |
| direction*. Stacking is straightforward and proposed as ADR-007 follow-up. | |
| ### Implementation cost | |
| - **~120 LOC** estimate (no reference code to vendor yet). | |
| - Dependencies: nothing new. Token-level entropy is `−(p * log p).sum(-1)`, | |
| forward KL is the existing teacher-on-student term, reverse KL is the | |
| student-on-teacher term we already compute for the JSD in OPSD. The work is | |
| re-shaping the existing per-token loss to expose both directions. | |
| - **Risk note:** code not yet public. We should hold this candidate behind a | |
| feature flag until the IBM/KAIST team releases reference code (expected by | |
| ICLR 2026 in May). If the implementation ships sooner we should vendor and | |
| match line-for-line; if not, we re-derive from the paper formula and add a | |
| unit test that reproduces their toy entropy-vs-divergence plot. | |
| --- | |
| ## Honourable mention — KTO (Kahneman-Tversky Optimization) | |
| - **arXiv:** https://arxiv.org/abs/2402.01306 | |
| - **Code:** integrated into HuggingFace `trl` library since v0.8 (Apache-2.0). | |
| - License/maturity: **production**. KTO is a standard `trl` trainer alongside DPO. | |
| ### Loss core | |
| KTO replaces preference pairs with **per-output binary desirability** signals. | |
| For a desirable output `y_+` and undesirable output `y_−`: | |
| ``` | |
| r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) ) | |
| z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ] (reference point) | |
| L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))] (desirable) | |
| + E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))] (undesirable) | |
| ``` | |
| with default `λ_D = λ_U = 1`. The derivation is via prospect theory: this is | |
| a Kahneman-Tversky utility function applied to the implicit reward. KTO | |
| matches DPO at 1B–30B even though it sees only `2n` binary signals where | |
| DPO sees `n` pairs. | |
| ### Why we down-rank it relative to the top-3 | |
| KTO is the right answer **only if** the framework wants to ingest single-side | |
| trace signals (e.g., "this trace step succeeded" / "this step crashed the | |
| agent") without constructing pairs. The current | |
| `research/05-trace-replay-distillation.md` design **does** construct pairs | |
| from multi-teacher replay (that is the whole point of the multi-teacher | |
| variance signal), so the marginal value of KTO is small *for channel 3 as | |
| specified*. If the trace-replay design pivots toward absolute scores per | |
| step rather than relative pairs, KTO becomes the right loss and is already | |
| free from `trl`. Add to the backlog as conditional. | |
| --- | |
| ## Audited but NOT recommended | |
| ### GKD — Generalized Knowledge Distillation (Agarwal et al., 2023) | |
| - **arXiv:** https://arxiv.org/abs/2306.13649 (Google DeepMind) | |
| - **Loss core:** student samples its own outputs, teacher provides token | |
| probabilities, divergence is generalized JSD with parameter β: | |
| ``` | |
| D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q) | |
| ``` | |
| - **Why excluded:** **this is exactly the formula we already have** as | |
| `composer_replication.opsd.generalized_jsd_loss` (lifted from | |
| `siyan-zhao/OPSD`). GKD's contribution beyond the loss formula is the | |
| on-policy student sampling protocol — which OPSD also does. No incremental | |
| value to add. | |
| ### DistiLLM (Ko et al., ICML 2024) | |
| - **arXiv:** https://arxiv.org/abs/2402.03898 | |
| - **GitHub:** https://github.com/jongwooko/distillm — MIT, last push 2025-03 | |
| - **Loss core:** *Skew KL divergence* `KL(p ‖ λp + (1−λ)q)` plus an *adaptive | |
| off-policy* student-generated-output (SGO) scheduler. | |
| - **Why excluded:** the skew-KL is a special case of generalized JSD (set the | |
| mixture coefficient appropriately) — same family the framework already | |
| has. The interesting contribution, the SGO scheduler, is a process | |
| optimisation, not a loss. The TAID paper's own ablation (Table 6) shows | |
| TAID > Skew KL across student sizes, so TAID dominates this candidate. | |
| ### MiniLLM (Gu et al., ICLR 2024) | |
| - **arXiv:** https://arxiv.org/abs/2306.08543 | |
| - **GitHub:** https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo | |
| active (last push 2026-04) | |
| - **Loss core:** reverse KL minimised by policy-gradient on student rollouts, | |
| with three optimisation tricks: single-step decomposition (variance | |
| reduction), teacher-mixed sampling (anti-reward-hacking), length | |
| normalisation. | |
| - **Why excluded:** reverse-KL on-policy distillation **is the same recipe | |
| family as SDPO/OPSD** the framework already implements. Adding MiniLLM | |
| would be a parallel implementation of the same idea, not an addition. | |
| Entropy-Aware OPD (Candidate 3) is a *strict improvement* over MiniLLM's | |
| pure reverse-KL on exactly the failure mode MiniLLM identifies (mode | |
| collapse in high-entropy regions). | |
| ### Self-Rewarding Language Models (Yuan et al., 2024) | |
| - **arXiv:** https://arxiv.org/abs/2401.10020 (Meta + NYU) | |
| - **Why excluded:** SRLM is a *training procedure* (iterative DPO with the | |
| model judging its own outputs), not a loss. The actual loss is plain DPO, | |
| which the framework already supports. The procedural contribution belongs | |
| in a future ADR on data generation, not in the distillation channel. | |
| ### TAID's relationship to "TAID arXiv 2501.16937 if it exists" | |
| The user asked us to verify existence. **It exists.** Submitted 2025-01-28, | |
| ICLR 2025, code at https://github.com/SakanaAI/TAID with two released | |
| checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source. | |
| --- | |
| ## 2026 papers found | |
| The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`) | |
| surfaced four 2026 distillation papers worth listing for completeness: | |
| 1. **Entropy-Aware On-Policy Distillation** — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above. | |
| 2. **KL for a KL: On-Policy Distillation with Control Variate Baseline** (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient. | |
| 3. **Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe** (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation. | |
| 4. **Hybrid Policy Distillation for LLMs** (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument. | |
| 5. **Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation** (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available. | |
| None of these except Entropy-Aware OPD are mature enough (released code + | |
| license + reproducible scale) to recommend adding right now. | |
| --- | |
| ## Recommended follow-up wiring | |
| For ADR-007 the proposed addition is a `composer_replication.distillation` | |
| sub-package with three pluggable hooks: | |
| > **Realised in v0.1 (Wave 17 update):** ADR-007 shipped a flatter | |
| > layout than the proposal below. Actual exports: | |
| > | |
| > ``` | |
| > composer_replication/ | |
| > distillation/ | |
| > __init__.py | |
| > simpo.py # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma) | |
| > # avg_sequence_logprob(logprobs, mask) -- helper | |
| > taid.py # taid_loss(student_logits, teacher_logits, t, *, ...) | |
| > # TAIDScheduler -- adaptive momentum schedule per the paper | |
| > entropy_aware_opd.py # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...) | |
| > ``` | |
| > | |
| > No `targets.py`/`losses.py` split, no top-level `preference/` package, | |
| > and SimPO lives under `distillation/` rather than `preference/` because | |
| > the three losses share a common dispatch surface (`compose_loss`'s | |
| > `dpo_variant` and `sdpo_wrapper` switches). | |
| > | |
| > The composition rule realised in `compose_loss` is per-loss flag-driven, | |
| > not a single composed-function call: | |
| > | |
| > ```python | |
| > compose_loss(model, inputs, | |
| > dpo_variant="simpo", # OR "dpo" (default) | |
| > sdpo_wrapper="taid", # OR "entropy_opd" OR "none" (default) | |
| > taid_t=0.5, # required when sdpo_wrapper="taid" | |
| > simpo_beta=2.0, simpo_gamma=1.0, # used only when dpo_variant="simpo" | |
| > entropy_opd_h_max=..., # used only when sdpo_wrapper="entropy_opd" | |
| > ) | |
| > ``` | |
| > | |
| > The pre-ADR proposal sketch below is preserved as historical context. | |
| > The shipped function names are `simpo_loss`, `taid_loss` + | |
| > `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` / | |
| > `entropy_aware_kl_loss`). | |
| ``` | |
| composer_replication/ | |
| distillation/ | |
| __init__.py | |
| targets.py # taid_target(...), fixed_target(...) ← Candidate 2 | |
| losses.py # reuses opsd.generalized_jsd_loss | |
| # adds entropy_aware_kl_loss(...) ← Candidate 3 | |
| preference/ | |
| simpo.py # simpo_loss(...) ← Candidate 1 | |
| dpo.py # existing trace-replay path | |
| ``` | |
| The composition rule for the total loss becomes: | |
| ``` | |
| L_total = λ_grpo · L_GRPO (channel 1, unchanged) | |
| + λ_distill · L_distill (channel 2, see below) | |
| + λ_pref · L_pref (channel 3, choose DPO or SimPO) | |
| L_distill = entropy_aware_kl_loss( | |
| target = taid_target(student, teacher, t), | |
| student = student, | |
| teacher_entropy_gate = α_t | |
| ) | |
| ``` | |
| This keeps the existing `generalized_jsd_loss` reachable as a fallback | |
| (set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly). | |
| --- | |
| ## Sources index | |
| | Paper | arXiv | GitHub | License | Last push | Maturity | | |
| |-------|-------|--------|---------|-----------|----------| | |
| | SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production | | |
| | TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production | | |
| | Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only | | |
| | KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production | | |
| | GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only | | |
| | DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research | | |
| | MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production | | |
| | Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss | | |
| --- | |
| ## Notes for ADR-007 author | |
| 1. **SimPO and TAID can land independently and without coordination.** They | |
| touch different files and do not compete. | |
| 2. **Entropy-Aware OPD should land last.** Wait for the IBM/KAIST authors' | |
| code release; if it's not out by the time we want to ship the change, the | |
| formula is simple enough to re-derive but we should pin a unit test that | |
| reproduces the paper's Figure 3 entropy-vs-divergence behaviour. | |
| 3. **Do not also pull in GKD/DistiLLM/MiniLLM.** Their loss contributions are | |
| strict subsets of what (TAID + Entropy-Aware OPD + existing | |
| `generalized_jsd_loss`) covers. | |
| 4. **KTO should be added as a backlog item** with a "trigger" condition: | |
| when the trace-replay reward design moves from preference pairs to per-step | |
| binary signals, switch on the `trl.KTOTrainer` path. | |
| --- | |
| *Absolute path of this report:* `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md` | |