composer-replication-framework / docs /adrs /ADR-014-policy-optimization-objective-menu.md

feat(trainer): policy-optimization objective MENU (ADR-014)

aae66fa 12 days ago

6.09 kB

status: accepted
date: 2026-05-30T00:00:00.000Z
deciders:
  - Codeseys
  - ARIA
builds-on:
  - ADR-006 (RL frameworks)
  - ADR-007 (distillation menu)
  - ADR-008 (Dr.GRPO base)

ADR-014: Policy-optimization objective MENU — make RL's base objective selectable

Context and Problem Statement

The framework replicates Composer 2.5's recipe as base RL objective + distillation channel + preference channel. ADR-007 already gave the distillation/preference axis a menu (SimPO / TAID / Entropy-Aware-OPD via compose_loss(dpo_variant=, sdpo_wrapper=)). But the base policy-optimization objective was hardcoded to Dr.GRPO via make_dr_grpo_config — loss_type="dr_grpo", scale_rewards="none", num_iterations=1, no other options.

User ask (2026-05-30): "look at other policy optimization papers like SDPO and Dr.GRPO or OPSD ... allow for RL to have multiple options ... [SDPO / what Composer 2.5 uses] is one of the best instances of setting up post-training to take models to massive performance gains."

So: give the base RL objective a real menu, grounded in the current PO frontier and in what Composer 2.5 actually does.

Research (primary-sourced, 2026-05-30)

Three parallel cross-family research passes (reports in docs/research/):

Composer 2.5's actual recipe (cursor.com/blog/composer-2-5 + Composer 2 report arXiv:2603.24477): base objective is Dr.GRPO (no length-std normalization, single-epoch, k1-discussed/k3-in-TRL KL, async RL with MoE router replay). Its headline 2.5 technique — "targeted RL with textual feedback" — is on-policy self-distillation (= our SDPO channel ✓). No DPO / preference pairs / multiple teachers appear in any Composer source — our trace-replay-DPO channel is the framework's own addition, NOT Composer's. Recorded honestly.
GRPO-family PO landscape: vanilla GRPO (2402.03300), Dr.GRPO (2503.20783), DAPO (2503.14476, decoupled clip-higher + dynamic sampling + KL-off), GSPO (2507.18071, sequence-level importance ratio — Qwen3, MoE-stable), CISPO (MiniMax-M1 2506.13585, detached-IS REINFORCE so every token keeps a gradient), GMPO (geometric-mean aggregation). Most are surgical edits to advantage-norm / length-norm / clip / ratio-granularity / KL.
TRL 1.5.0 capability map — then verified by introspecting the installed package (not a GitHub snapshot): the installed trl 1.5.0 GRPOTrainer already implements loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo} and exposes epsilon/epsilon_high (decoupled clip), delta, beta, scale_rewards ∈ {group,batch,none}, importance_sampling_level ∈ {token, sequence} (= GSPO), mask_truncated_completions, num_iterations. Every objective we want is therefore PURE CONFIG — no custom _compute_loss needed.

Decision

Add make_po_config(objective=..., **overrides) to trainer/composer_trainer.py — a named-preset factory over trl 1.5.0's verified GRPOConfig knob-space. Keep make_dr_grpo_config intact (back-compat); dr_grpo is now one preset among the menu.

Menu (PO_OBJECTIVES):

objective	loss_type	key knobs	what it gives
`grpo`	grpo	scale_rewards=group, IS=token	vanilla GRPO (std-norm advantage)
`dr_grpo`	dr_grpo	scale_rewards=none, IS=token	default; no length-std bias (Composer 2.5 base)
`bnpo`	bnpo	scale_rewards=batch	batch-normalized variant
`dapo`	dapo	epsilon_high=0.28, mask_truncated, beta=0	decoupled clip-higher, anti-entropy-collapse
`gspo`	grpo	IS=sequence	sequence-level ratio; long-CoT / MoE stable (Qwen3)
`cispo`	cispo	epsilon_high=5.0	detached-IS REINFORCE; every token keeps gradient

Each preset carries literature-recommended settings; any field is overridable via **overrides. Drift guards assert the requested loss_type / importance_sampling_level / epsilon_high actually applied, so a preset can never silently degrade (e.g. GSPO with importance_sampling_level overridden back to token raises rather than quietly becoming GRPO).

Consequences

Positive: RL now has 6 selectable base objectives at zero custom-loss cost; the ladder/runners gain an objective= knob orthogonal to the existing SDPO/replay channels and the SimPO/TAID/EA-OPD distillation menu. A user can run e.g. objective="dapo" (clip-higher) or objective="gspo" (MoE-stable) instead of only Dr.GRPO.
Positive: faithful to Composer 2.5 (dr_grpo default + self-distillation) while exposing the stronger 2025-26 objectives the report predates.
Neutral: gspo is grpo loss + importance_sampling_level="sequence" (trl has no literal "gspo"); documented in the preset + guarded.
Negative / honest: presets reflect current trl 1.5.0 field names; a trl upgrade could rename knobs — the drift guards turn that into a loud failure, not silent mis-training. sapo/luspo/vespo exist in this trl build but are NOT in the menu yet (less-established; add later if validated).

Acceptance gate

make_po_config(objective) with presets grpo/dr_grpo/bnpo/dapo/gspo/cispo, built over trl-1.5.0-verified knobs (introspected, not assumed).
make_dr_grpo_config preserved; dr_grpo preset is equivalent; default objective is dr_grpo.
Unit tests (10) green against real trl 1.5.0: each preset's defining knob asserted, unknown-objective raises with the menu, GSPO token-override guard fires, overrides apply.
Drift guards on loss_type / IS-level / epsilon_high.
Wire objective= through the LMA ladder runners (follow-up; the A1 run used dr_grpo — re-runnable with objective="dapo" etc. once threaded).

More Information

docs/research/SELF_DISTILLATION_LANDSCAPE.md — the distillation/preference menu (ADR-007).
Composer 2.5 recipe report + GRPO-family survey + trl-1.5.0 capability map: research pass 2026-05-30 (parallel cross-family).
Installed-trl introspection confirming the knob surface: 2026-05-30.