feat(trainer): policy-optimization objective MENU (ADR-014)

Adds make_po_config(objective=..., **overrides) — RL's base objective is no
longer hardcoded to Dr.GRPO. Six selectable named presets over trl 1.5.0's
verified GRPOConfig knob-space (introspected the installed package, not a
GitHub snapshot):

grpo | vanilla GRPO (std-norm advantage)
dr_grpo | DEFAULT; no length-std bias (Composer 2.5's base objective)
bnpo | batch-normalized
dapo | decoupled clip-higher (epsilon_high=0.28) + overlong mask + KL off
gspo | sequence-level importance ratio (Qwen3; long-CoT / MoE stable)
cispo | detached-IS REINFORCE (every token keeps a gradient; MiniMax-M1)

Every preset is PURE CONFIG (trl already implements each loss_type branch +
importance_sampling_level/epsilon_high) — no custom _compute_loss. Drift guards
assert loss_type / IS-level / epsilon_high actually applied so a preset can't
silently degrade (e.g. GSPO with IS overridden back to token raises). 10 unit
tests green against real trl 1.5.0.

Research-grounded: Composer 2.5 = Dr.GRPO + on-policy self-distillation (= our
SDPO channel); its sources mention NO DPO/preference/multi-teacher, so the
trace-replay-DPO channel is documented as the framework's own addition, not
Composer's. make_dr_grpo_config preserved for back-compat (== dr_grpo preset).

Follow-up: thread objective= through the LMA ladder runners (A1 used dr_grpo).

Files changed (3) hide show

composer_replication/trainer/composer_trainer.py +146 -1
composer_replication/trainer/tests/test_po_objective_menu.py +87 -0
docs/adrs/ADR-014-policy-optimization-objective-menu.md +110 -0

composer_replication/trainer/composer_trainer.py CHANGED Viewed

@@ -401,4 +401,149 @@ def make_dr_grpo_config(**overrides: Any):
     return cfg
-__all__ = ["ComposerReplicationTrainer", "make_dr_grpo_config"]

     return cfg
+# ---------------------------------------------------------------------------
+# Policy-optimization objective MENU (ADR-014)
+# ---------------------------------------------------------------------------
+#
+# The base RL objective used to be hardcoded to Dr.GRPO (make_dr_grpo_config).
+# make_po_config gives RL a real menu: GRPO-family objectives selectable by name.
+# Verified against the installed trl==1.5.0 (introspected 2026-05-30): its
+# GRPOTrainer already implements these as `loss_type` branches + knobs, so EVERY
+# preset below is pure config — no custom _compute_loss override needed.
+#
+# Knob-space each preset sets (all real GRPOConfig fields in trl 1.5.0):
+#   loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo}   (gspo = grpo loss +
+#       importance_sampling_level="sequence"; trl has no literal "gspo")
+#   scale_rewards ∈ {"group"(std-norm), "batch", "none"(no std-norm, Dr.GRPO)}
+#   epsilon / epsilon_high   — symmetric vs decoupled "clip-higher" (DAPO)
+#   importance_sampling_level ∈ {"token", "sequence"(GSPO)}
+#   beta                     — KL-to-ref coef (0.0 = reference-free)
+#   mask_truncated_completions — DAPO overlong masking
+#   num_iterations           — on-policy reuse (1 = strict on-policy)
+#: Selectable base policy-optimization objectives (named presets over trl knobs).
+PO_OBJECTIVES: dict[str, dict[str, Any]] = {
+    # Vanilla GRPO (DeepSeekMath, arXiv 2402.03300): group-relative advantage
+    # WITH std normalization + per-sequence length normalization, KL on.
+    "grpo": {
+        "loss_type": "grpo",
+        "scale_rewards": "group",
+        "importance_sampling_level": "token",
+        "num_iterations": 1,
+    },
+    # Dr.GRPO (arXiv 2503.20783): remove length-std normalization bias (no
+    # advantage /std, length-independent aggregation). Framework's historical
+    # default (== make_dr_grpo_config). Composer 2.5's base objective.
+    "dr_grpo": {
+        "loss_type": "dr_grpo",
+        "scale_rewards": "none",
+        "importance_sampling_level": "token",
+        "num_iterations": 1,
+    },
+    # BNPO: batch-normalized variant (trl loss_type), std over the batch.
+    "bnpo": {
+        "loss_type": "bnpo",
+        "scale_rewards": "batch",
+        "importance_sampling_level": "token",
+        "num_iterations": 1,
+    },
+    # DAPO (arXiv 2503.14476): decoupled "clip-higher" (epsilon_high > epsilon)
+    # + token-level loss + overlong masking + KL removed. High-value, low-cost
+    # anti-entropy-collapse objective. epsilon_high=0.28 per the paper.
+    "dapo": {
+        "loss_type": "dapo",
+        "scale_rewards": "none",
+        "epsilon": 0.2,
+        "epsilon_high": 0.28,
+        "mask_truncated_completions": True,
+        "beta": 0.0,
+        "importance_sampling_level": "token",
+        "num_iterations": 1,
+    },
+    # GSPO (Qwen, arXiv 2507.18071): SEQUENCE-level importance ratio (one length-
+    # normalized ratio per response) — stabilizes long-CoT and especially MoE RL.
+    # trl expresses this as the grpo loss + importance_sampling_level="sequence".
+    "gspo": {
+        "loss_type": "grpo",
+        "scale_rewards": "group",
+        "importance_sampling_level": "sequence",
+        "num_iterations": 1,
+    },
+    # CISPO (MiniMax-M1, arXiv 2506.13585): clip the IS weight and detach it as a
+    # constant coefficient on log π — every token keeps a gradient (fixes the
+    # "rare reasoning tokens get zeroed by the clip" pathology). eps_max≈5 (ScaleRL).
+    "cispo": {
+        "loss_type": "cispo",
+        "scale_rewards": "none",
+        "epsilon_high": 5.0,
+        "importance_sampling_level": "token",
+        "num_iterations": 1,
+    },
+}
+def make_po_config(objective: str = "dr_grpo", **overrides: Any):
+    """Build a `trl.GRPOConfig` for a NAMED policy-optimization objective.
+    The menu that gives RL real options beyond the single hardcoded Dr.GRPO
+    recipe. ``objective`` selects a preset from ``PO_OBJECTIVES`` (grpo /
+    dr_grpo / bnpo / dapo / gspo / cispo); ``**overrides`` set or override any
+    GRPOConfig field on top (e.g. ``output_dir=...``, ``beta=...``,
+    ``learning_rate=...``).
+    All presets are PURE CONFIG over trl 1.5.0's GRPOTrainer (verified by
+    introspecting the installed package 2026-05-30): the trainer already
+    implements each ``loss_type`` branch and the ``importance_sampling_level`` /
+    ``epsilon_high`` knobs, so no custom ``_compute_loss`` is needed. See ADR-014.
+    Raises:
+        ValueError: unknown objective (lists the valid menu).
+        AssertionError: a requested knob silently failed to apply (drift guard).
+    """
+    from trl import GRPOConfig  # local import: only when actually building a config
+    key = (objective or "dr_grpo").lower()
+    if key not in PO_OBJECTIVES:
+        raise ValueError(
+            f"Unknown PO objective {objective!r}. Choose from: "
+            f"{sorted(PO_OBJECTIVES)}. (Each is a named preset over trl 1.5.0's "
+            f"GRPOConfig knobs — see PO_OBJECTIVES / ADR-014.)"
+        )
+    preset = dict(PO_OBJECTIVES[key])
+    merged = {**preset, **overrides}
+    cfg = GRPOConfig(**merged)
+    # Drift guards: fail loudly if a future trl renamed/repurposed a knob we set,
+    # so a preset can never silently degrade to a different objective.
+    if "loss_type" in merged:
+        assert str(cfg.loss_type) == str(merged["loss_type"]), (
+            f"GRPOConfig.loss_type drifted: requested {merged['loss_type']!r}, "
+            f"got {cfg.loss_type!r} — trl may have renamed the knob."
+        )
+    if "importance_sampling_level" in merged and hasattr(cfg, "importance_sampling_level"):
+        assert str(cfg.importance_sampling_level) == str(
+            merged["importance_sampling_level"]
+        ), (
+            f"importance_sampling_level drifted for objective {key!r}: requested "
+            f"{merged['importance_sampling_level']!r}, got {cfg.importance_sampling_level!r}."
+        )
+    if key == "gspo":
+        assert str(getattr(cfg, "importance_sampling_level", "token")) == "sequence", (
+            "GSPO requires importance_sampling_level='sequence'; it was overridden "
+            "to token, which silently degrades GSPO to GRPO. Drop that override."
+        )
+    if merged.get("epsilon_high") is not None:
+        assert abs(
+            float(getattr(cfg, "epsilon_high", merged["epsilon_high"]))
+            - float(merged["epsilon_high"])
+        ) < 1e-9, f"epsilon_high (decoupled clip) drifted for {key!r}."
+    return cfg
+__all__ = [
+    "ComposerReplicationTrainer",
+    "make_dr_grpo_config",
+    "make_po_config",
+    "PO_OBJECTIVES",
+]

composer_replication/trainer/tests/test_po_objective_menu.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""Tests for the policy-optimization objective menu (make_po_config, ADR-014).
+These build real trl GRPOConfigs, so they require trl installed (the framework's
+.venv has trl==1.5.0). Skips cleanly if trl is absent.
+"""
+from __future__ import annotations
+import pytest
+trl = pytest.importorskip("trl")
+from composer_replication.trainer.composer_trainer import (  # noqa: E402
+    PO_OBJECTIVES,
+    make_po_config,
+)
+def test_menu_lists_expected_objectives():
+    assert set(PO_OBJECTIVES) == {"grpo", "dr_grpo", "bnpo", "dapo", "gspo", "cispo"}
+def test_unknown_objective_raises_with_menu(tmp_path):
+    with pytest.raises(ValueError) as ei:
+        make_po_config("nope", output_dir=str(tmp_path))
+    msg = str(ei.value)
+    assert "Unknown PO objective" in msg and "dapo" in msg and "gspo" in msg
+def test_grpo_preset(tmp_path):
+    cfg = make_po_config("grpo", output_dir=str(tmp_path))
+    assert str(cfg.loss_type) == "grpo"
+    assert str(cfg.importance_sampling_level) == "token"
+    # group scaling = std-normalized advantage (vanilla GRPO)
+    assert str(cfg.scale_rewards).lower() in ("group", "true")
+def test_dr_grpo_preset_matches_legacy(tmp_path):
+    cfg = make_po_config("dr_grpo", output_dir=str(tmp_path))
+    assert str(cfg.loss_type) == "dr_grpo"
+    # no std-normalization (the Dr.GRPO fix)
+    assert str(cfg.scale_rewards).lower() in ("none", "false")
+def test_dapo_preset_sets_decoupled_clip(tmp_path):
+    cfg = make_po_config("dapo", output_dir=str(tmp_path))
+    assert str(cfg.loss_type) == "dapo"
+    # clip-higher: epsilon_high strictly above epsilon
+    assert cfg.epsilon_high is not None
+    assert float(cfg.epsilon_high) > float(cfg.epsilon)
+    assert bool(cfg.mask_truncated_completions) is True
+    assert float(cfg.beta) == 0.0  # DAPO removes KL
+def test_gspo_is_sequence_level(tmp_path):
+    cfg = make_po_config("gspo", output_dir=str(tmp_path))
+    # GSPO = grpo loss + SEQUENCE-level importance ratio
+    assert str(cfg.loss_type) == "grpo"
+    assert str(cfg.importance_sampling_level) == "sequence"
+def test_gspo_guard_rejects_token_override(tmp_path):
+    # Overriding back to token-level would silently degrade GSPO to GRPO -> guard.
+    with pytest.raises(AssertionError):
+        make_po_config(
+            "gspo", output_dir=str(tmp_path), importance_sampling_level="token"
+        )
+def test_cispo_preset(tmp_path):
+    cfg = make_po_config("cispo", output_dir=str(tmp_path))
+    assert str(cfg.loss_type) == "cispo"
+    # eps_max (ScaleRL recommended 5.0) carried via epsilon_high
+    assert cfg.epsilon_high is not None and float(cfg.epsilon_high) >= 5.0
+def test_overrides_apply_on_top(tmp_path):
+    cfg = make_po_config(
+        "dr_grpo", output_dir=str(tmp_path), beta=0.05, num_generations=4
+    )
+    assert float(cfg.beta) == 0.05
+    assert int(cfg.num_generations) == 4
+    assert str(cfg.loss_type) == "dr_grpo"  # preset preserved under overrides
+def test_default_objective_is_dr_grpo(tmp_path):
+    cfg = make_po_config(output_dir=str(tmp_path))
+    assert str(cfg.loss_type) == "dr_grpo"

docs/adrs/ADR-014-policy-optimization-objective-menu.md ADDED Viewed

	@@ -0,0 +1,110 @@

+---
+status: accepted
+date: 2026-05-30
+deciders: [Codeseys, ARIA]
+builds-on: [ADR-006 (RL frameworks), ADR-007 (distillation menu), ADR-008 (Dr.GRPO base)]
+---
+# ADR-014: Policy-optimization objective MENU — make RL's base objective selectable
+## Context and Problem Statement
+The framework replicates Composer 2.5's recipe as **base RL objective + distillation
+channel + preference channel**. ADR-007 already gave the *distillation/preference*
+axis a menu (SimPO / TAID / Entropy-Aware-OPD via `compose_loss(dpo_variant=,
+sdpo_wrapper=)`). But the **base policy-optimization objective was hardcoded to
+Dr.GRPO** via `make_dr_grpo_config` — `loss_type="dr_grpo"`, `scale_rewards="none"`,
+`num_iterations=1`, no other options.
+User ask (2026-05-30): *"look at other policy optimization papers like SDPO and
+Dr.GRPO or OPSD ... allow for RL to have multiple options ... [SDPO / what Composer
+2.5 uses] is one of the best instances of setting up post-training to take models to
+massive performance gains."*
+So: give the base RL objective a real menu, grounded in the current PO frontier and
+in what Composer 2.5 actually does.
+## Research (primary-sourced, 2026-05-30)
+Three parallel cross-family research passes (reports in `docs/research/`):
+1. **Composer 2.5's actual recipe** (cursor.com/blog/composer-2-5 + Composer 2 report
+   arXiv:2603.24477): base objective is **Dr.GRPO** (no length-std normalization,
+   single-epoch, k1-discussed/k3-in-TRL KL, async RL with MoE router replay). Its
+   headline 2.5 technique — "targeted RL with textual feedback" — is **on-policy
+   self-distillation** (= our SDPO channel ✓). **No DPO / preference pairs / multiple
+   teachers appear in any Composer source** — our trace-replay-DPO channel is the
+   framework's own addition, NOT Composer's. Recorded honestly.
+2. **GRPO-family PO landscape**: vanilla GRPO (2402.03300), Dr.GRPO (2503.20783),
+   DAPO (2503.14476, decoupled clip-higher + dynamic sampling + KL-off), GSPO
+   (2507.18071, sequence-level importance ratio — Qwen3, MoE-stable), CISPO
+   (MiniMax-M1 2506.13585, detached-IS REINFORCE so every token keeps a gradient),
+   GMPO (geometric-mean aggregation). Most are surgical edits to advantage-norm /
+   length-norm / clip / ratio-granularity / KL.
+3. **TRL 1.5.0 capability map** — then **verified by introspecting the installed
+   package** (not a GitHub snapshot): the installed trl 1.5.0 `GRPOTrainer` already
+   implements `loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}`
+   and exposes `epsilon`/`epsilon_high` (decoupled clip), `delta`, `beta`,
+   `scale_rewards ∈ {group,batch,none}`, `importance_sampling_level ∈ {token,
+   sequence}` (= GSPO), `mask_truncated_completions`, `num_iterations`. **Every
+   objective we want is therefore PURE CONFIG — no custom `_compute_loss` needed.**
+## Decision
+Add **`make_po_config(objective=..., **overrides)`** to `trainer/composer_trainer.py`
+— a named-preset factory over trl 1.5.0's verified GRPOConfig knob-space. Keep
+`make_dr_grpo_config` intact (back-compat); `dr_grpo` is now one preset among the menu.
+Menu (`PO_OBJECTIVES`):
+| objective | loss_type | key knobs | what it gives |
+|---|---|---|---|
+| `grpo` | grpo | scale_rewards=group, IS=token | vanilla GRPO (std-norm advantage) |
+| `dr_grpo` | dr_grpo | scale_rewards=none, IS=token | **default**; no length-std bias (Composer 2.5 base) |
+| `bnpo` | bnpo | scale_rewards=batch | batch-normalized variant |
+| `dapo` | dapo | epsilon_high=0.28, mask_truncated, beta=0 | decoupled clip-higher, anti-entropy-collapse |
+| `gspo` | grpo | IS=**sequence** | sequence-level ratio; long-CoT / MoE stable (Qwen3) |
+| `cispo` | cispo | epsilon_high=5.0 | detached-IS REINFORCE; every token keeps gradient |
+Each preset carries literature-recommended settings; any field is overridable via
+`**overrides`. **Drift guards** assert the requested `loss_type` /
+`importance_sampling_level` / `epsilon_high` actually applied, so a preset can never
+silently degrade (e.g. GSPO with `importance_sampling_level` overridden back to token
+raises rather than quietly becoming GRPO).
+### Consequences
+- **Positive**: RL now has 6 selectable base objectives at zero custom-loss cost; the
+  ladder/runners gain an `objective=` knob orthogonal to the existing SDPO/replay
+  channels and the SimPO/TAID/EA-OPD distillation menu. A user can run e.g.
+  `objective="dapo"` (clip-higher) or `objective="gspo"` (MoE-stable) instead of only
+  Dr.GRPO.
+- **Positive**: faithful to Composer 2.5 (dr_grpo default + self-distillation) while
+  exposing the stronger 2025-26 objectives the report predates.
+- **Neutral**: `gspo` is `grpo` loss + `importance_sampling_level="sequence"` (trl has
+  no literal "gspo"); documented in the preset + guarded.
+- **Negative / honest**: presets reflect *current* trl 1.5.0 field names; a trl upgrade
+  could rename knobs — the drift guards turn that into a loud failure, not silent
+  mis-training. `sapo`/`luspo`/`vespo` exist in this trl build but are NOT in the menu
+  yet (less-established; add later if validated).
+## Acceptance gate
+- [x] `make_po_config(objective)` with presets grpo/dr_grpo/bnpo/dapo/gspo/cispo,
+  built over trl-1.5.0-verified knobs (introspected, not assumed).
+- [x] `make_dr_grpo_config` preserved; `dr_grpo` preset is equivalent; default objective
+  is `dr_grpo`.
+- [x] Unit tests (10) green against real trl 1.5.0: each preset's defining knob asserted,
+  unknown-objective raises with the menu, GSPO token-override guard fires, overrides apply.
+- [x] Drift guards on loss_type / IS-level / epsilon_high.
+- [ ] Wire `objective=` through the LMA ladder runners (follow-up; the A1 run used
+  dr_grpo — re-runnable with `objective="dapo"` etc. once threaded).
+## More Information
+- `docs/research/SELF_DISTILLATION_LANDSCAPE.md` — the distillation/preference menu (ADR-007).
+- Composer 2.5 recipe report + GRPO-family survey + trl-1.5.0 capability map: research
+  pass 2026-05-30 (parallel cross-family).
+- Installed-trl introspection confirming the knob surface: 2026-05-30.