composer-replication-framework / docs /research /SELF_DISTILLATION_LANDSCAPE.md

Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports

a84c060 12 days ago

24 kB

	# Self-Distillation Landscape Audit (feeds ADR-007)

	Status: research note, pre-experimental
	Author: subagent audit
	Date: 2026-05-25
	Scope: identify 2–3 distillation-channel losses worth adding to
	`composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` +
	multi-teacher trace-replay DPO stack.
	Bias: additivity over novelty. We are looking for losses that COMPOSE with
	what is already implemented, not duplicates of it.

	---

	## TL;DR — recommended additions

	\| Rank \| Method \| Loss role \| License \| LOC est. \| Why it composes \|
	\|------\|--------\|-----------\|---------\|----------\|-----------------\|
	\| 1 \| SimPO (NeurIPS 2024) \| Preference, reference-free \| MIT \| ~80 \| Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel \|
	\| 2 \| TAID (ICLR 2025) \| Interpolated-target wrapper around any KL/JSD \| Apache-2.0 \| ~150 \| Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students \|
	\| 3 \| Entropy-Aware OPD (ICLR 2026 Spotlight) \| Token-gated forward/reverse KL mixture \| CC BY 4.0 (paper); code expected \| ~120 \| Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 \|

	Honourable mention: KTO — useful only if the framework wants to ingest
	binary thumbs-up/thumbs-down trace signals without preference pairs.
	Not recommended: GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).

	---

	## Audit method

	For each candidate paper (the seven the user named, plus 2026 follow-ups
	discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`)
	we verified:

	1. Primary source exists. arXiv abstract page reachable; HTML body parsed
	to extract the actual loss formula (not summarised from secondary sources).
	2. Code is real. Official repo's README was fetched, `last push` date and
	star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained
	were marked as such.
	3. License is permissive enough. MIT, Apache-2.0, BSD, CC BY 4.0 are
	acceptable for inclusion. GPL or research-only would be flagged.
	4. Composability check. Read the framework's existing
	`composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`,
	then asked: does this loss replace something we have, or stack on top?

	---

	## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED

	### Sources
	- arXiv: https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
	- GitHub: https://github.com/princeton-nlp/SimPO
	- License: MIT
	- 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
	- Built on top of `huggingface/alignment-handbook`
	- Maturity: production-ready. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.

	### Loss core (reference-free preference)
	SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory)
	with the average log-probability of the sequence under the policy, plus
	a target reward margin γ:

	```
	r(x, y) = (β / \|y\|) · log π_θ(y \| x) ← length-normalised implicit reward
	(no reference model)

	L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
	log σ( r(x, y_w) − r(x, y_l) − γ )
	]
	```

	where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin
	between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting
	point). Two consequences: (i) no `π_ref` forward pass per step → roughly half
	the memory, and (ii) the implicit reward is exactly the quantity the model
	generates from at decode time, removing a known DPO pathology where
	decoding-time and training-time rewards diverge.

	### Why it composes with the existing stack
	- The framework's channel 3 is multi-teacher trace-replay DPO. SimPO is a
	drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)`
	data contract, different loss head. So the trace-replay harvester does not
	change at all.
	- It does not touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two
	are complementary: JSD-distillation transfers token-level teacher knowledge,
	SimPO sharpens preference structure between trace alternatives.
	- It does not duplicate GRPO either. GRPO is online-policy RLVR;
	SimPO is offline preference. Different data sources.
	- The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points
	on AlpacaEval-2 LC, which directly translates to "if we already have channel-3
	pairs, SimPO is a free upgrade".

	### Implementation cost
	- ~80 LOC for the trainer hook; the loss itself is ~15 lines (log-probs,
	length-normalise, margin, BCE).
	- Dependencies: nothing new — `torch`, `transformers` already in repo.
	- The reference implementation is a single file in `princeton-nlp/SimPO`
	(`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can
	vendor it exactly as we did with OPSD.

	---

	## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED

	### Sources
	- arXiv: https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
	- GitHub: https://github.com/SakanaAI/TAID
	- License: Apache-2.0
	- 121 stars, last push 2025-10-06 (actively maintained)
	- Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free
	- Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale).
	- Maturity: published, single-author commits but reproducibly trained two SoTA compact models with it.

	### Loss core (interpolated teacher target)
	Standard distillation losses (forward KL, reverse KL, JSD, including the
	`generalized_jsd_loss` we already have) target a fixed teacher distribution
	`p_T`. TAID replaces this fixed target with a **time-dependent interpolated
	target** `p_t` that starts close to the student and moves toward the teacher
	as training progresses:

	```
	p_t(y \| x) = (1 − t) · q_θ_stop(y \| x) + t · p_T(y \| x) (1)

	J_TAID(θ; t) = D_KL( p_t ‖ q_θ ) (2)
	```

	`q_θ_stop` is the student's own current distribution with stop-gradient. The
	interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an
	adaptive momentum schedule that grows `t` faster when training loss is
	falling and slower when it stalls — this is the "temporally adaptive" part.
	The Sakana paper proves (Theorem 4.1) that for the regression analogue this
	schedule provably prevents the mode-collapse failure mode of pure
	self-distillation.

	Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you
	can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the
	framework already exports*. TAID is therefore a wrapper around an existing
	divergence*, not a competing divergence.

	### Why it composes with the existing stack
	- It wraps `composer_replication.opsd.generalized_jsd_loss` rather than
	replacing it. The change is "compute the JSD against `p_t` instead of
	`p_T`" — a few lines around the existing call site.
	- Addresses a documented weakness of OPSD-style self-distillation: when the
	teacher's privileged-context distribution is far from the student's
	capacity, the JSD signal can be noisy or push the student into mode
	averaging. TAID's annealed target gives the student a curriculum.
	- Empirical evidence the Sakana paper directly compares with: TAID + JSD
	beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation,
	with 0.7 h / epoch vs 9.8 h / epoch for GKD on identical hardware.
	The speed comes from not needing student-generated outputs (SGOs) at every
	step the way GKD does.
	- Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO)
	because TAID lives strictly inside channel 2.

	### Implementation cost
	- ~150 LOC. The change is:
	1. A `TAIDState` object that holds `t`, the EMA of training loss, and the
	momentum coefficient β (default 0.99).
	2. A function `taid_target(student_logits, teacher_logits, t)` that returns
	`(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`.
	3. A scheduler hook that updates `t` after each backward pass per
	Algorithm 1 of the paper.
	- Dependencies: nothing new.
	- Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is
	Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.

	---

	## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED

	### Sources
	- OpenReview (ICLR 2026 Spotlight): https://openreview.net/forum?id=WSRQ37tzk1
	- IBM Research page: https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
	- Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
	- Status: ICLR 2026 Spotlight, submission #113. License on the OpenReview record is CC BY 4.0.
	- Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. Maturity flag: paper-ready, code-pending. This is the only candidate where we'd need to re-implement from the paper.

	### Loss core (entropy-gated forward/reverse KL mixture)
	The paper diagnoses a failure mode in the reverse-KL-on-policy distillation
	recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the
	teacher distribution has high entropy at a given token, reverse KL's
	mode-seeking gradient becomes noisy and collapses the student's diversity.
	Their fix: at each token `t`, gate between forward and reverse KL based on
	the teacher's entropy:

	```
	H_t = − Σ_v p_T(v \| x, y_<t) · log p_T(v \| x, y_<t) (teacher entropy)

	α_t = sigmoid( (H_t − τ) / s ) ∈ (0, 1)

	L_EA(θ) = E_{y ~ q_θ} Σ_t [
	(1 − α_t) · D_KL( q_θ(· \| x, y_<t) ‖ p_T(· \| x, y_<t) ) ← reverse KL
	+ α_t · D_KL( p_T(· \| x, y_<t) ‖ q_θ(· \| x, y_<t) ) ← forward KL
	]
	```

	`τ` is an entropy threshold (default ≈ 1.0 nat in their experiments) and `s`
	is a temperature controlling how sharp the gate is. When the teacher is
	confident (`H_t` small → `α_t ≈ 0`) the loss is pure reverse KL, identical to
	MiniLLM/OPSD behaviour. When the teacher is uncertain (`H_t` large → `α_t ≈ 1`)
	the loss switches to forward KL, which is mode-covering and preserves
	student diversity.

	Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on
	six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The
	larger gains at larger student size suggest the failure mode reverse KL
	exhibits gets worse with capacity, not better.

	### Why it composes with the existing stack
	- It is strictly token-wise: same trajectory, same teacher logits, same
	rollout pipeline as the existing channel 2. The only change is the loss
	reduction — instead of computing `generalized_jsd_loss` with a single fixed
	β, you compute a per-token mixture of forward and reverse KL with weight
	given by teacher entropy.
	- This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is
	privileged-context teacher distribution under student rollouts. EA-OPD's
	contribution is which divergence to use at each token of that distribution.
	Both can be true simultaneously.
	- Directly addresses a failure mode the framework's roadmap will hit:
	multi-teacher trace replay (channel 3) produces high-entropy aggregated
	teacher distributions at exactly the steps where teachers disagree. Those
	are the steps where reverse KL behaves worst. EA-OPD's entropy gate would
	automatically soften the loss on those exact tokens.
	- Composes with TAID (Candidate 2) too — they operate on different axes:
	TAID anneals the target distribution, EA-OPD chooses the *divergence
	direction*. Stacking is straightforward and proposed as ADR-007 follow-up.

	### Implementation cost
	- ~120 LOC estimate (no reference code to vendor yet).
	- Dependencies: nothing new. Token-level entropy is `−(p * log p).sum(-1)`,
	forward KL is the existing teacher-on-student term, reverse KL is the
	student-on-teacher term we already compute for the JSD in OPSD. The work is
	re-shaping the existing per-token loss to expose both directions.
	- Risk note: code not yet public. We should hold this candidate behind a
	feature flag until the IBM/KAIST team releases reference code (expected by
	ICLR 2026 in May). If the implementation ships sooner we should vendor and
	match line-for-line; if not, we re-derive from the paper formula and add a
	unit test that reproduces their toy entropy-vs-divergence plot.

	---

	## Honourable mention — KTO (Kahneman-Tversky Optimization)

	- arXiv: https://arxiv.org/abs/2402.01306
	- Code: integrated into HuggingFace `trl` library since v0.8 (Apache-2.0).
	- License/maturity: production. KTO is a standard `trl` trainer alongside DPO.

	### Loss core
	KTO replaces preference pairs with per-output binary desirability signals.
	For a desirable output `y_+` and undesirable output `y_−`:

	```
	r_θ(x, y) = β · log( π_θ(y\|x) / π_ref(y\|x) )

	z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·\|x') ‖ π_ref(·\|x') ) ] (reference point)

	L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))] (desirable)
	+ E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))] (undesirable)
	```

	with default `λ_D = λ_U = 1`. The derivation is via prospect theory: this is
	a Kahneman-Tversky utility function applied to the implicit reward. KTO
	matches DPO at 1B–30B even though it sees only `2n` binary signals where
	DPO sees `n` pairs.

	### Why we down-rank it relative to the top-3
	KTO is the right answer only if the framework wants to ingest single-side
	trace signals (e.g., "this trace step succeeded" / "this step crashed the
	agent") without constructing pairs. The current
	`research/05-trace-replay-distillation.md` design does construct pairs
	from multi-teacher replay (that is the whole point of the multi-teacher
	variance signal), so the marginal value of KTO is small *for channel 3 as
	specified*. If the trace-replay design pivots toward absolute scores per
	step rather than relative pairs, KTO becomes the right loss and is already
	free from `trl`. Add to the backlog as conditional.

	---

	## Audited but NOT recommended

	### GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)
	- arXiv: https://arxiv.org/abs/2306.13649 (Google DeepMind)
	- Loss core: student samples its own outputs, teacher provides token
	probabilities, divergence is generalized JSD with parameter β:
	```
	D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ‖ βP+(1−β)Q)
	```
	- Why excluded: this is exactly the formula we already have as
	`composer_replication.opsd.generalized_jsd_loss` (lifted from
	`siyan-zhao/OPSD`). GKD's contribution beyond the loss formula is the
	on-policy student sampling protocol — which OPSD also does. No incremental
	value to add.

	### DistiLLM (Ko et al., ICML 2024)
	- arXiv: https://arxiv.org/abs/2402.03898
	- GitHub: https://github.com/jongwooko/distillm — MIT, last push 2025-03
	- Loss core: Skew KL divergence `KL(p ‖ λp + (1−λ)q)` plus an *adaptive
	off-policy* student-generated-output (SGO) scheduler.
	- Why excluded: the skew-KL is a special case of generalized JSD (set the
	mixture coefficient appropriately) — same family the framework already
	has. The interesting contribution, the SGO scheduler, is a process
	optimisation, not a loss. The TAID paper's own ablation (Table 6) shows
	TAID > Skew KL across student sizes, so TAID dominates this candidate.

	### MiniLLM (Gu et al., ICLR 2024)
	- arXiv: https://arxiv.org/abs/2306.08543
	- GitHub: https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo
	active (last push 2026-04)
	- Loss core: reverse KL minimised by policy-gradient on student rollouts,
	with three optimisation tricks: single-step decomposition (variance
	reduction), teacher-mixed sampling (anti-reward-hacking), length
	normalisation.
	- Why excluded: reverse-KL on-policy distillation **is the same recipe
	family as SDPO/OPSD** the framework already implements. Adding MiniLLM
	would be a parallel implementation of the same idea, not an addition.
	Entropy-Aware OPD (Candidate 3) is a strict improvement over MiniLLM's
	pure reverse-KL on exactly the failure mode MiniLLM identifies (mode
	collapse in high-entropy regions).

	### Self-Rewarding Language Models (Yuan et al., 2024)
	- arXiv: https://arxiv.org/abs/2401.10020 (Meta + NYU)
	- Why excluded: SRLM is a training procedure (iterative DPO with the
	model judging its own outputs), not a loss. The actual loss is plain DPO,
	which the framework already supports. The procedural contribution belongs
	in a future ADR on data generation, not in the distillation channel.

	### TAID's relationship to "TAID arXiv 2501.16937 if it exists"
	The user asked us to verify existence. It exists. Submitted 2025-01-28,
	ICLR 2025, code at https://github.com/SakanaAI/TAID with two released
	checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source.

	---

	## 2026 papers found

	The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`)
	surfaced four 2026 distillation papers worth listing for completeness:

	1. Entropy-Aware On-Policy Distillation — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
	2. KL for a KL: On-Policy Distillation with Control Variate Baseline (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
	3. Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
	4. Hybrid Policy Distillation for LLMs (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
	5. Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.

	None of these except Entropy-Aware OPD are mature enough (released code +
	license + reproducible scale) to recommend adding right now.

	---

	## Recommended follow-up wiring

	For ADR-007 the proposed addition is a `composer_replication.distillation`
	sub-package with three pluggable hooks:

	> Realised in v0.1 (Wave 17 update): ADR-007 shipped a flatter
	> layout than the proposal below. Actual exports:
	>
	> ```
	> composer_replication/
	> distillation/
	> __init__.py
	> simpo.py # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma)
	> # avg_sequence_logprob(logprobs, mask) -- helper
	> taid.py # taid_loss(student_logits, teacher_logits, t, *, ...)
	> # TAIDScheduler -- adaptive momentum schedule per the paper
	> entropy_aware_opd.py # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)
	> ```
	>
	> No `targets.py`/`losses.py` split, no top-level `preference/` package,
	> and SimPO lives under `distillation/` rather than `preference/` because
	> the three losses share a common dispatch surface (`compose_loss`'s
	> `dpo_variant` and `sdpo_wrapper` switches).
	>
	> The composition rule realised in `compose_loss` is per-loss flag-driven,
	> not a single composed-function call:
	>
	> ```python
	> compose_loss(model, inputs,
	> dpo_variant="simpo", # OR "dpo" (default)
	> sdpo_wrapper="taid", # OR "entropy_opd" OR "none" (default)
	> taid_t=0.5, # required when sdpo_wrapper="taid"
	> simpo_beta=2.0, simpo_gamma=1.0, # used only when dpo_variant="simpo"
	> entropy_opd_h_max=..., # used only when sdpo_wrapper="entropy_opd"
	> )
	> ```
	>
	> The pre-ADR proposal sketch below is preserved as historical context.
	> The shipped function names are `simpo_loss`, `taid_loss` +
	> `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` /
	> `entropy_aware_kl_loss`).

	```
	composer_replication/
	distillation/
	__init__.py
	targets.py # taid_target(...), fixed_target(...) ← Candidate 2
	losses.py # reuses opsd.generalized_jsd_loss
	# adds entropy_aware_kl_loss(...) ← Candidate 3
	preference/
	simpo.py # simpo_loss(...) ← Candidate 1
	dpo.py # existing trace-replay path
	```

	The composition rule for the total loss becomes:

	```
	L_total = λ_grpo · L_GRPO (channel 1, unchanged)
	+ λ_distill · L_distill (channel 2, see below)
	+ λ_pref · L_pref (channel 3, choose DPO or SimPO)

	L_distill = entropy_aware_kl_loss(
	target = taid_target(student, teacher, t),
	student = student,
	teacher_entropy_gate = α_t
	)
	```

	This keeps the existing `generalized_jsd_loss` reachable as a fallback
	(set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly).

	---

	## Sources index

	\| Paper \| arXiv \| GitHub \| License \| Last push \| Maturity \|
	\|-------\|-------\|--------\|---------\|-----------\|----------\|
	\| SimPO \| https://arxiv.org/abs/2405.14734 \| https://github.com/princeton-nlp/SimPO \| MIT \| 2024-10-12 \| Production \|
	\| TAID \| https://arxiv.org/abs/2501.16937 \| https://github.com/SakanaAI/TAID \| Apache-2.0 \| 2025-10-06 \| Production \|
	\| Entropy-Aware OPD \| n/a (OpenReview WSRQ37tzk1) \| code-pending \| CC BY 4.0 (paper) \| n/a \| Paper-only \|
	\| KTO \| https://arxiv.org/abs/2402.01306 \| huggingface/trl (built-in) \| Apache-2.0 \| continuous \| Production \|
	\| GKD \| https://arxiv.org/abs/2306.13649 \| (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) \| n/a \| n/a \| Reference only \|
	\| DistiLLM \| https://arxiv.org/abs/2402.03898 \| https://github.com/jongwooko/distillm \| (no LICENSE file at audit time) \| 2025-03-13 \| Research \|
	\| MiniLLM \| https://arxiv.org/abs/2306.08543 \| https://github.com/microsoft/LMOps/tree/main/minillm \| MIT \| 2026-04-08 \| Production \|
	\| Self-Rewarding LM \| https://arxiv.org/abs/2401.10020 \| (no canonical repo; integrated into many forks) \| n/a \| n/a \| Procedure, not a loss \|

	---

	## Notes for ADR-007 author

	1. SimPO and TAID can land independently and without coordination. They
	touch different files and do not compete.
	2. Entropy-Aware OPD should land last. Wait for the IBM/KAIST authors'
	code release; if it's not out by the time we want to ship the change, the
	formula is simple enough to re-derive but we should pin a unit test that
	reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
	3. Do not also pull in GKD/DistiLLM/MiniLLM. Their loss contributions are
	strict subsets of what (TAID + Entropy-Aware OPD + existing
	`generalized_jsd_loss`) covers.
	4. KTO should be added as a backlog item with a "trigger" condition:
	when the trace-replay reward design moves from preference pairs to per-step
	binary signals, switch on the `trl.KTOTrainer` path.

	---

	Absolute path of this report: `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md`