Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
status: accepted
date: 2026-05-30T00:00:00.000Z
deciders:
- Codeseys
- ARIA
builds-on:
- ADR-006 (RL frameworks)
- ADR-007 (distillation menu)
- ADR-008 (Dr.GRPO base)
ADR-014: Policy-optimization objective MENU — make RL's base objective selectable
Context and Problem Statement
The framework replicates Composer 2.5's recipe as base RL objective + distillation
channel + preference channel. ADR-007 already gave the distillation/preference
axis a menu (SimPO / TAID / Entropy-Aware-OPD via compose_loss(dpo_variant=, sdpo_wrapper=)). But the base policy-optimization objective was hardcoded to
Dr.GRPO via make_dr_grpo_config — loss_type="dr_grpo", scale_rewards="none",
num_iterations=1, no other options.
User ask (2026-05-30): "look at other policy optimization papers like SDPO and Dr.GRPO or OPSD ... allow for RL to have multiple options ... [SDPO / what Composer 2.5 uses] is one of the best instances of setting up post-training to take models to massive performance gains."
So: give the base RL objective a real menu, grounded in the current PO frontier and in what Composer 2.5 actually does.
Research (primary-sourced, 2026-05-30)
Three parallel cross-family research passes (reports in docs/research/):
Composer 2.5's actual recipe (cursor.com/blog/composer-2-5 + Composer 2 report arXiv:2603.24477): base objective is Dr.GRPO (no length-std normalization, single-epoch, k1-discussed/k3-in-TRL KL, async RL with MoE router replay). Its headline 2.5 technique — "targeted RL with textual feedback" — is on-policy self-distillation (= our SDPO channel ✓). No DPO / preference pairs / multiple teachers appear in any Composer source — our trace-replay-DPO channel is the framework's own addition, NOT Composer's. Recorded honestly.
GRPO-family PO landscape: vanilla GRPO (2402.03300), Dr.GRPO (2503.20783), DAPO (2503.14476, decoupled clip-higher + dynamic sampling + KL-off), GSPO (2507.18071, sequence-level importance ratio — Qwen3, MoE-stable), CISPO (MiniMax-M1 2506.13585, detached-IS REINFORCE so every token keeps a gradient), GMPO (geometric-mean aggregation). Most are surgical edits to advantage-norm / length-norm / clip / ratio-granularity / KL.
TRL 1.5.0 capability map — then verified by introspecting the installed package (not a GitHub snapshot): the installed trl 1.5.0
GRPOTraineralready implementsloss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}and exposesepsilon/epsilon_high(decoupled clip),delta,beta,scale_rewards ∈ {group,batch,none},importance_sampling_level ∈ {token, sequence}(= GSPO),mask_truncated_completions,num_iterations. Every objective we want is therefore PURE CONFIG — no custom_compute_lossneeded.
Decision
Add make_po_config(objective=..., **overrides) to trainer/composer_trainer.py
— a named-preset factory over trl 1.5.0's verified GRPOConfig knob-space. Keep
make_dr_grpo_config intact (back-compat); dr_grpo is now one preset among the menu.
Menu (PO_OBJECTIVES):
| objective | loss_type | key knobs | what it gives |
|---|---|---|---|
grpo |
grpo | scale_rewards=group, IS=token | vanilla GRPO (std-norm advantage) |
dr_grpo |
dr_grpo | scale_rewards=none, IS=token | default; no length-std bias (Composer 2.5 base) |
bnpo |
bnpo | scale_rewards=batch | batch-normalized variant |
dapo |
dapo | epsilon_high=0.28, mask_truncated, beta=0 | decoupled clip-higher, anti-entropy-collapse |
gspo |
grpo | IS=sequence | sequence-level ratio; long-CoT / MoE stable (Qwen3) |
cispo |
cispo | epsilon_high=5.0 | detached-IS REINFORCE; every token keeps gradient |
Each preset carries literature-recommended settings; any field is overridable via
**overrides. Drift guards assert the requested loss_type /
importance_sampling_level / epsilon_high actually applied, so a preset can never
silently degrade (e.g. GSPO with importance_sampling_level overridden back to token
raises rather than quietly becoming GRPO).
Consequences
- Positive: RL now has 6 selectable base objectives at zero custom-loss cost; the
ladder/runners gain an
objective=knob orthogonal to the existing SDPO/replay channels and the SimPO/TAID/EA-OPD distillation menu. A user can run e.g.objective="dapo"(clip-higher) orobjective="gspo"(MoE-stable) instead of only Dr.GRPO. - Positive: faithful to Composer 2.5 (dr_grpo default + self-distillation) while exposing the stronger 2025-26 objectives the report predates.
- Neutral:
gspoisgrpoloss +importance_sampling_level="sequence"(trl has no literal "gspo"); documented in the preset + guarded. - Negative / honest: presets reflect current trl 1.5.0 field names; a trl upgrade
could rename knobs — the drift guards turn that into a loud failure, not silent
mis-training.
sapo/luspo/vespoexist in this trl build but are NOT in the menu yet (less-established; add later if validated).
Acceptance gate
-
make_po_config(objective)with presets grpo/dr_grpo/bnpo/dapo/gspo/cispo, built over trl-1.5.0-verified knobs (introspected, not assumed). -
make_dr_grpo_configpreserved;dr_grpopreset is equivalent; default objective isdr_grpo. - Unit tests (10) green against real trl 1.5.0: each preset's defining knob asserted, unknown-objective raises with the menu, GSPO token-override guard fires, overrides apply.
- Drift guards on loss_type / IS-level / epsilon_high.
- Wire
objective=through the LMA ladder runners (follow-up; the A1 run used dr_grpo — re-runnable withobjective="dapo"etc. once threaded).
More Information
docs/research/SELF_DISTILLATION_LANDSCAPE.md— the distillation/preference menu (ADR-007).- Composer 2.5 recipe report + GRPO-family survey + trl-1.5.0 capability map: research pass 2026-05-30 (parallel cross-family).
- Installed-trl introspection confirming the knob surface: 2026-05-30.