Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Self-Distillation Landscape Audit (feeds ADR-007)
Status: research note, pre-experimental
Author: subagent audit
Date: 2026-05-25
Scope: identify 2–3 distillation-channel losses worth adding to
composer_replication alongside the existing GRPO + SDPO/OPSD generalized_jsd_loss +
multi-teacher trace-replay DPO stack.
Bias: additivity over novelty. We are looking for losses that COMPOSE with
what is already implemented, not duplicates of it.
TL;DR — recommended additions
| Rank | Method | Loss role | License | LOC est. | Why it composes |
|---|---|---|---|---|---|
| 1 | SimPO (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel |
| 2 | TAID (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing generalized_jsd_loss — does not replace it. Closes capacity gap on small students |
| 3 | Entropy-Aware OPD (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 |
Honourable mention: KTO — useful only if the framework wants to ingest binary thumbs-up/thumbs-down trace signals without preference pairs. Not recommended: GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).
Audit method
For each candidate paper (the seven the user named, plus 2026 follow-ups
discovered via Exa search restricted to category=research paper, startPublishedDate=2026-01-01)
we verified:
- Primary source exists. arXiv abstract page reachable; HTML body parsed to extract the actual loss formula (not summarised from secondary sources).
- Code is real. Official repo's README was fetched,
last pushdate and star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained were marked as such. - License is permissive enough. MIT, Apache-2.0, BSD, CC BY 4.0 are acceptable for inclusion. GPL or research-only would be flagged.
- Composability check. Read the framework's existing
composer_replication/__init__.pyandresearch/05-trace-replay-distillation.md, then asked: does this loss replace something we have, or stack on top?
Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED
Sources
- arXiv: https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
- GitHub: https://github.com/princeton-nlp/SimPO
- License: MIT
- 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
- Built on top of
huggingface/alignment-handbook
- Maturity: production-ready. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.
Loss core (reference-free preference)
SimPO replaces the DPO log-ratio (which requires keeping π_ref in memory)
with the average log-probability of the sequence under the policy, plus
a target reward margin γ:
r(x, y) = (β / |y|) · log π_θ(y | x) ← length-normalised implicit reward
(no reference model)
L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
log σ( r(x, y_w) − r(x, y_l) − γ )
]
where β is a temperature (typically 2.0–10) and γ is the desired margin
between chosen and rejected (the repo recommends γ/β ≈ 0.5 as a starting
point). Two consequences: (i) no π_ref forward pass per step → roughly half
the memory, and (ii) the implicit reward is exactly the quantity the model
generates from at decode time, removing a known DPO pathology where
decoding-time and training-time rewards diverge.
Why it composes with the existing stack
- The framework's channel 3 is multi-teacher trace-replay DPO. SimPO is a
drop-in replacement for the DPO step inside that channel — same
(x, y_w, y_l)data contract, different loss head. So the trace-replay harvester does not change at all. - It does not touch channel 2 (SDPO/OPSD
generalized_jsd_loss). The two are complementary: JSD-distillation transfers token-level teacher knowledge, SimPO sharpens preference structure between trace alternatives. - It does not duplicate GRPO either. GRPO is online-policy RLVR; SimPO is offline preference. Different data sources.
- The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points on AlpacaEval-2 LC, which directly translates to "if we already have channel-3 pairs, SimPO is a free upgrade".
Implementation cost
- ~80 LOC for the trainer hook; the loss itself is ~15 lines (log-probs, length-normalise, margin, BCE).
- Dependencies: nothing new —
torch,transformersalready in repo. - The reference implementation is a single file in
princeton-nlp/SimPO(scripts/run_simpo.py+alignment/trainer subclass) under MIT, so we can vendor it exactly as we did with OPSD.
Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED
Sources
- arXiv: https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
- GitHub: https://github.com/SakanaAI/TAID
- License: Apache-2.0
- 121 stars, last push 2025-10-06 (actively maintained)
- Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in
src/distil_losses/for free
- Released artefacts:
TAID-LLM-1.5B,TAID-VLM-2Bon HuggingFace (so the loss is verified at non-trivial scale). - Maturity: published, single-author commits but reproducibly trained two SoTA compact models with it.
Loss core (interpolated teacher target)
Standard distillation losses (forward KL, reverse KL, JSD, including the
generalized_jsd_loss we already have) target a fixed teacher distribution
p_T. TAID replaces this fixed target with a time-dependent interpolated
target p_t that starts close to the student and moves toward the teacher
as training progresses:
p_t(y | x) = (1 − t) · q_θ_stop(y | x) + t · p_T(y | x) (1)
J_TAID(θ; t) = D_KL( p_t ‖ q_θ ) (2)
q_θ_stop is the student's own current distribution with stop-gradient. The
interpolation coefficient t ∈ [t_start, 1] is updated each step by an
adaptive momentum schedule that grows t faster when training loss is
falling and slower when it stalls — this is the "temporally adaptive" part.
The Sakana paper proves (Theorem 4.1) that for the regression analogue this
schedule provably prevents the mode-collapse failure mode of pure
self-distillation.
Critically, D_KL(p_t ‖ q_θ) is just any divergence on shifted target — you
can equally well plug in JSD, reverse KL, or the generalized_jsd_loss the
framework already exports. TAID is therefore a wrapper around an existing
divergence, not a competing divergence.
Why it composes with the existing stack
- It wraps
composer_replication.opsd.generalized_jsd_lossrather than replacing it. The change is "compute the JSD againstp_tinstead ofp_T" — a few lines around the existing call site. - Addresses a documented weakness of OPSD-style self-distillation: when the teacher's privileged-context distribution is far from the student's capacity, the JSD signal can be noisy or push the student into mode averaging. TAID's annealed target gives the student a curriculum.
- Empirical evidence the Sakana paper directly compares with: TAID + JSD beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation, with 0.7 h / epoch vs 9.8 h / epoch for GKD on identical hardware. The speed comes from not needing student-generated outputs (SGOs) at every step the way GKD does.
- Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO) because TAID lives strictly inside channel 2.
Implementation cost
- ~150 LOC. The change is:
- A
TAIDStateobject that holdst, the EMA of training loss, and the momentum coefficient β (default 0.99). - A function
taid_target(student_logits, teacher_logits, t)that returns(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits). - A scheduler hook that updates
tafter each backward pass per Algorithm 1 of the paper.
- A
- Dependencies: nothing new.
- Reference implementation in
SakanaAI/TAID/src/distil_losses/taid.pyis Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.
Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED
Sources
- OpenReview (ICLR 2026 Spotlight): https://openreview.net/forum?id=WSRQ37tzk1
- IBM Research page: https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
- Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
- Status: ICLR 2026 Spotlight, submission #113. License on the OpenReview record is CC BY 4.0.
- Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. Maturity flag: paper-ready, code-pending. This is the only candidate where we'd need to re-implement from the paper.
Loss core (entropy-gated forward/reverse KL mixture)
The paper diagnoses a failure mode in the reverse-KL-on-policy distillation
recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the
teacher distribution has high entropy at a given token, reverse KL's
mode-seeking gradient becomes noisy and collapses the student's diversity.
Their fix: at each token t, gate between forward and reverse KL based on
the teacher's entropy:
H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t) (teacher entropy)
α_t = sigmoid( (H_t − τ) / s ) ∈ (0, 1)
L_EA(θ) = E_{y ~ q_θ} Σ_t [
(1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) ) ← reverse KL
+ α_t · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) ) ← forward KL
]
τ is an entropy threshold (default ≈ 1.0 nat in their experiments) and s
is a temperature controlling how sharp the gate is. When the teacher is
confident (H_t small → α_t ≈ 0) the loss is pure reverse KL, identical to
MiniLLM/OPSD behaviour. When the teacher is uncertain (H_t large → α_t ≈ 1)
the loss switches to forward KL, which is mode-covering and preserves
student diversity.
Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The larger gains at larger student size suggest the failure mode reverse KL exhibits gets worse with capacity, not better.
Why it composes with the existing stack
- It is strictly token-wise: same trajectory, same teacher logits, same
rollout pipeline as the existing channel 2. The only change is the loss
reduction — instead of computing
generalized_jsd_losswith a single fixed β, you compute a per-token mixture of forward and reverse KL with weight given by teacher entropy. - This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is privileged-context teacher distribution under student rollouts. EA-OPD's contribution is which divergence to use at each token of that distribution. Both can be true simultaneously.
- Directly addresses a failure mode the framework's roadmap will hit: multi-teacher trace replay (channel 3) produces high-entropy aggregated teacher distributions at exactly the steps where teachers disagree. Those are the steps where reverse KL behaves worst. EA-OPD's entropy gate would automatically soften the loss on those exact tokens.
- Composes with TAID (Candidate 2) too — they operate on different axes: TAID anneals the target distribution, EA-OPD chooses the divergence direction. Stacking is straightforward and proposed as ADR-007 follow-up.
Implementation cost
- ~120 LOC estimate (no reference code to vendor yet).
- Dependencies: nothing new. Token-level entropy is
−(p * log p).sum(-1), forward KL is the existing teacher-on-student term, reverse KL is the student-on-teacher term we already compute for the JSD in OPSD. The work is re-shaping the existing per-token loss to expose both directions. - Risk note: code not yet public. We should hold this candidate behind a feature flag until the IBM/KAIST team releases reference code (expected by ICLR 2026 in May). If the implementation ships sooner we should vendor and match line-for-line; if not, we re-derive from the paper formula and add a unit test that reproduces their toy entropy-vs-divergence plot.
Honourable mention — KTO (Kahneman-Tversky Optimization)
- arXiv: https://arxiv.org/abs/2402.01306
- Code: integrated into HuggingFace
trllibrary since v0.8 (Apache-2.0). - License/maturity: production. KTO is a standard
trltrainer alongside DPO.
Loss core
KTO replaces preference pairs with per-output binary desirability signals.
For a desirable output y_+ and undesirable output y_−:
r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )
z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ] (reference point)
L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))] (desirable)
+ E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))] (undesirable)
with default λ_D = λ_U = 1. The derivation is via prospect theory: this is
a Kahneman-Tversky utility function applied to the implicit reward. KTO
matches DPO at 1B–30B even though it sees only 2n binary signals where
DPO sees n pairs.
Why we down-rank it relative to the top-3
KTO is the right answer only if the framework wants to ingest single-side
trace signals (e.g., "this trace step succeeded" / "this step crashed the
agent") without constructing pairs. The current
research/05-trace-replay-distillation.md design does construct pairs
from multi-teacher replay (that is the whole point of the multi-teacher
variance signal), so the marginal value of KTO is small for channel 3 as
specified. If the trace-replay design pivots toward absolute scores per
step rather than relative pairs, KTO becomes the right loss and is already
free from trl. Add to the backlog as conditional.
Audited but NOT recommended
GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)
- arXiv: https://arxiv.org/abs/2306.13649 (Google DeepMind)
- Loss core: student samples its own outputs, teacher provides token
probabilities, divergence is generalized JSD with parameter β:
D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q) - Why excluded: this is exactly the formula we already have as
composer_replication.opsd.generalized_jsd_loss(lifted fromsiyan-zhao/OPSD). GKD's contribution beyond the loss formula is the on-policy student sampling protocol — which OPSD also does. No incremental value to add.
DistiLLM (Ko et al., ICML 2024)
- arXiv: https://arxiv.org/abs/2402.03898
- GitHub: https://github.com/jongwooko/distillm — MIT, last push 2025-03
- Loss core: Skew KL divergence
KL(p ‖ λp + (1−λ)q)plus an adaptive off-policy student-generated-output (SGO) scheduler. - Why excluded: the skew-KL is a special case of generalized JSD (set the mixture coefficient appropriately) — same family the framework already has. The interesting contribution, the SGO scheduler, is a process optimisation, not a loss. The TAID paper's own ablation (Table 6) shows TAID > Skew KL across student sizes, so TAID dominates this candidate.
MiniLLM (Gu et al., ICLR 2024)
- arXiv: https://arxiv.org/abs/2306.08543
- GitHub: https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo active (last push 2026-04)
- Loss core: reverse KL minimised by policy-gradient on student rollouts, with three optimisation tricks: single-step decomposition (variance reduction), teacher-mixed sampling (anti-reward-hacking), length normalisation.
- Why excluded: reverse-KL on-policy distillation is the same recipe family as SDPO/OPSD the framework already implements. Adding MiniLLM would be a parallel implementation of the same idea, not an addition. Entropy-Aware OPD (Candidate 3) is a strict improvement over MiniLLM's pure reverse-KL on exactly the failure mode MiniLLM identifies (mode collapse in high-entropy regions).
Self-Rewarding Language Models (Yuan et al., 2024)
- arXiv: https://arxiv.org/abs/2401.10020 (Meta + NYU)
- Why excluded: SRLM is a training procedure (iterative DPO with the model judging its own outputs), not a loss. The actual loss is plain DPO, which the framework already supports. The procedural contribution belongs in a future ADR on data generation, not in the distillation channel.
TAID's relationship to "TAID arXiv 2501.16937 if it exists"
The user asked us to verify existence. It exists. Submitted 2025-01-28,
ICLR 2025, code at https://github.com/SakanaAI/TAID with two released
checkpoints (TAID-LLM-1.5B, TAID-VLM-2B). Confirmed primary source.
2026 papers found
The targeted Exa search (category=research paper, startPublishedDate=2026-01-01)
surfaced four 2026 distillation papers worth listing for completeness:
- Entropy-Aware On-Policy Distillation — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
- KL for a KL: On-Policy Distillation with Control Variate Baseline (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
- Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
- Hybrid Policy Distillation for LLMs (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
- Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.
None of these except Entropy-Aware OPD are mature enough (released code + license + reproducible scale) to recommend adding right now.
Recommended follow-up wiring
For ADR-007 the proposed addition is a composer_replication.distillation
sub-package with three pluggable hooks:
Realised in v0.1 (Wave 17 update): ADR-007 shipped a flatter layout than the proposal below. Actual exports:
composer_replication/ distillation/ __init__.py simpo.py # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma) # avg_sequence_logprob(logprobs, mask) -- helper taid.py # taid_loss(student_logits, teacher_logits, t, *, ...) # TAIDScheduler -- adaptive momentum schedule per the paper entropy_aware_opd.py # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)No
targets.py/losses.pysplit, no top-levelpreference/package, and SimPO lives underdistillation/rather thanpreference/because the three losses share a common dispatch surface (compose_loss'sdpo_variantandsdpo_wrapperswitches).The composition rule realised in
compose_lossis per-loss flag-driven, not a single composed-function call:compose_loss(model, inputs, dpo_variant="simpo", # OR "dpo" (default) sdpo_wrapper="taid", # OR "entropy_opd" OR "none" (default) taid_t=0.5, # required when sdpo_wrapper="taid" simpo_beta=2.0, simpo_gamma=1.0, # used only when dpo_variant="simpo" entropy_opd_h_max=..., # used only when sdpo_wrapper="entropy_opd" )The pre-ADR proposal sketch below is preserved as historical context. The shipped function names are
simpo_loss,taid_loss+TAIDScheduler, andentropy_aware_opd_loss(nottaid_target/entropy_aware_kl_loss).
composer_replication/
distillation/
__init__.py
targets.py # taid_target(...), fixed_target(...) ← Candidate 2
losses.py # reuses opsd.generalized_jsd_loss
# adds entropy_aware_kl_loss(...) ← Candidate 3
preference/
simpo.py # simpo_loss(...) ← Candidate 1
dpo.py # existing trace-replay path
The composition rule for the total loss becomes:
L_total = λ_grpo · L_GRPO (channel 1, unchanged)
+ λ_distill · L_distill (channel 2, see below)
+ λ_pref · L_pref (channel 3, choose DPO or SimPO)
L_distill = entropy_aware_kl_loss(
target = taid_target(student, teacher, t),
student = student,
teacher_entropy_gate = α_t
)
This keeps the existing generalized_jsd_loss reachable as a fallback
(set α_t ≡ 0 and t ≡ 1 and you recover SDPO/OPSD exactly).
Sources index
| Paper | arXiv | GitHub | License | Last push | Maturity |
|---|---|---|---|---|---|
| SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production |
| TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production |
| Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only |
| KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production |
| GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only |
| DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research |
| MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production |
| Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss |
Notes for ADR-007 author
- SimPO and TAID can land independently and without coordination. They touch different files and do not compete.
- Entropy-Aware OPD should land last. Wait for the IBM/KAIST authors' code release; if it's not out by the time we want to ship the change, the formula is simple enough to re-derive but we should pin a unit test that reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
- Do not also pull in GKD/DistiLLM/MiniLLM. Their loss contributions are
strict subsets of what (TAID + Entropy-Aware OPD + existing
generalized_jsd_loss) covers. - KTO should be added as a backlog item with a "trigger" condition:
when the trace-replay reward design moves from preference pairs to per-step
binary signals, switch on the
trl.KTOTrainerpath.
Absolute path of this report: /mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md