composer-replication-framework / docs /adrs /ADR-014-policy-optimization-objective-menu.md

feat(trainer): policy-optimization objective MENU (ADR-014)

aae66fa 12 days ago

6.09 kB

	---
	status: accepted
	date: 2026-05-30
	deciders: [Codeseys, ARIA]
	builds-on: [ADR-006 (RL frameworks), ADR-007 (distillation menu), ADR-008 (Dr.GRPO base)]
	---

	# ADR-014: Policy-optimization objective MENU — make RL's base objective selectable

	## Context and Problem Statement

	The framework replicates Composer 2.5's recipe as **base RL objective + distillation
	channel + preference channel*. ADR-007 already gave the distillation/preference*
	axis a menu (SimPO / TAID / Entropy-Aware-OPD via `compose_loss(dpo_variant=,
	sdpo_wrapper=)`). But the **base policy-optimization objective was hardcoded to
	Dr.GRPO** via `make_dr_grpo_config` — `loss_type="dr_grpo"`, `scale_rewards="none"`,
	`num_iterations=1`, no other options.

	User ask (2026-05-30): *"look at other policy optimization papers like SDPO and
	Dr.GRPO or OPSD ... allow for RL to have multiple options ... [SDPO / what Composer
	2.5 uses] is one of the best instances of setting up post-training to take models to
	massive performance gains."*

	So: give the base RL objective a real menu, grounded in the current PO frontier and
	in what Composer 2.5 actually does.

	## Research (primary-sourced, 2026-05-30)

	Three parallel cross-family research passes (reports in `docs/research/`):

	1. Composer 2.5's actual recipe (cursor.com/blog/composer-2-5 + Composer 2 report
	arXiv:2603.24477): base objective is Dr.GRPO (no length-std normalization,
	single-epoch, k1-discussed/k3-in-TRL KL, async RL with MoE router replay). Its
	headline 2.5 technique — "targeted RL with textual feedback" — is **on-policy
	self-distillation (= our SDPO channel ✓). No DPO / preference pairs / multiple
	teachers appear in any Composer source** — our trace-replay-DPO channel is the
	framework's own addition, NOT Composer's. Recorded honestly.

	2. GRPO-family PO landscape: vanilla GRPO (2402.03300), Dr.GRPO (2503.20783),
	DAPO (2503.14476, decoupled clip-higher + dynamic sampling + KL-off), GSPO
	(2507.18071, sequence-level importance ratio — Qwen3, MoE-stable), CISPO
	(MiniMax-M1 2506.13585, detached-IS REINFORCE so every token keeps a gradient),
	GMPO (geometric-mean aggregation). Most are surgical edits to advantage-norm /
	length-norm / clip / ratio-granularity / KL.

	3. TRL 1.5.0 capability map — then **verified by introspecting the installed
	package** (not a GitHub snapshot): the installed trl 1.5.0 `GRPOTrainer` already
	implements `loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}`
	and exposes `epsilon`/`epsilon_high` (decoupled clip), `delta`, `beta`,
	`scale_rewards ∈ {group,batch,none}`, `importance_sampling_level ∈ {token,
	sequence}` (= GSPO), `mask_truncated_completions`, `num_iterations`. **Every
	objective we want is therefore PURE CONFIG — no custom `_compute_loss` needed.**

	## Decision

	Add `make_po_config(objective=..., overrides)`** to `trainer/composer_trainer.py`
	— a named-preset factory over trl 1.5.0's verified GRPOConfig knob-space. Keep
	`make_dr_grpo_config` intact (back-compat); `dr_grpo` is now one preset among the menu.

	Menu (`PO_OBJECTIVES`):

	\| objective \| loss_type \| key knobs \| what it gives \|
	\|---\|---\|---\|---\|
	\| `grpo` \| grpo \| scale_rewards=group, IS=token \| vanilla GRPO (std-norm advantage) \|
	\| `dr_grpo` \| dr_grpo \| scale_rewards=none, IS=token \| default; no length-std bias (Composer 2.5 base) \|
	\| `bnpo` \| bnpo \| scale_rewards=batch \| batch-normalized variant \|
	\| `dapo` \| dapo \| epsilon_high=0.28, mask_truncated, beta=0 \| decoupled clip-higher, anti-entropy-collapse \|
	\| `gspo` \| grpo \| IS=sequence \| sequence-level ratio; long-CoT / MoE stable (Qwen3) \|
	\| `cispo` \| cispo \| epsilon_high=5.0 \| detached-IS REINFORCE; every token keeps gradient \|

	Each preset carries literature-recommended settings; any field is overridable via
	`overrides`. Drift guards** assert the requested `loss_type` /
	`importance_sampling_level` / `epsilon_high` actually applied, so a preset can never
	silently degrade (e.g. GSPO with `importance_sampling_level` overridden back to token
	raises rather than quietly becoming GRPO).

	### Consequences

	- Positive: RL now has 6 selectable base objectives at zero custom-loss cost; the
	ladder/runners gain an `objective=` knob orthogonal to the existing SDPO/replay
	channels and the SimPO/TAID/EA-OPD distillation menu. A user can run e.g.
	`objective="dapo"` (clip-higher) or `objective="gspo"` (MoE-stable) instead of only
	Dr.GRPO.
	- Positive: faithful to Composer 2.5 (dr_grpo default + self-distillation) while
	exposing the stronger 2025-26 objectives the report predates.
	- Neutral: `gspo` is `grpo` loss + `importance_sampling_level="sequence"` (trl has
	no literal "gspo"); documented in the preset + guarded.
	- Negative / honest: presets reflect current trl 1.5.0 field names; a trl upgrade
	could rename knobs — the drift guards turn that into a loud failure, not silent
	mis-training. `sapo`/`luspo`/`vespo` exist in this trl build but are NOT in the menu
	yet (less-established; add later if validated).

	## Acceptance gate

	- [x] `make_po_config(objective)` with presets grpo/dr_grpo/bnpo/dapo/gspo/cispo,
	built over trl-1.5.0-verified knobs (introspected, not assumed).
	- [x] `make_dr_grpo_config` preserved; `dr_grpo` preset is equivalent; default objective
	is `dr_grpo`.
	- [x] Unit tests (10) green against real trl 1.5.0: each preset's defining knob asserted,
	unknown-objective raises with the menu, GSPO token-override guard fires, overrides apply.
	- [x] Drift guards on loss_type / IS-level / epsilon_high.
	- [ ] Wire `objective=` through the LMA ladder runners (follow-up; the A1 run used
	dr_grpo — re-runnable with `objective="dapo"` etc. once threaded).

	## More Information

	- `docs/research/SELF_DISTILLATION_LANDSCAPE.md` — the distillation/preference menu (ADR-007).
	- Composer 2.5 recipe report + GRPO-family survey + trl-1.5.0 capability map: research
	pass 2026-05-30 (parallel cross-family).
	- Installed-trl introspection confirming the knob surface: 2026-05-30.