Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| status: accepted | |
| date: 2026-05-30 | |
| deciders: [Codeseys, ARIA] | |
| builds-on: [ADR-006 (RL frameworks), ADR-007 (distillation menu), ADR-008 (Dr.GRPO base)] | |
| # ADR-014: Policy-optimization objective MENU — make RL's base objective selectable | |
| ## Context and Problem Statement | |
| The framework replicates Composer 2.5's recipe as **base RL objective + distillation | |
| channel + preference channel**. ADR-007 already gave the *distillation/preference* | |
| axis a menu (SimPO / TAID / Entropy-Aware-OPD via `compose_loss(dpo_variant=, | |
| sdpo_wrapper=)`). But the **base policy-optimization objective was hardcoded to | |
| Dr.GRPO** via `make_dr_grpo_config` — `loss_type="dr_grpo"`, `scale_rewards="none"`, | |
| `num_iterations=1`, no other options. | |
| User ask (2026-05-30): *"look at other policy optimization papers like SDPO and | |
| Dr.GRPO or OPSD ... allow for RL to have multiple options ... [SDPO / what Composer | |
| 2.5 uses] is one of the best instances of setting up post-training to take models to | |
| massive performance gains."* | |
| So: give the base RL objective a real menu, grounded in the current PO frontier and | |
| in what Composer 2.5 actually does. | |
| ## Research (primary-sourced, 2026-05-30) | |
| Three parallel cross-family research passes (reports in `docs/research/`): | |
| 1. **Composer 2.5's actual recipe** (cursor.com/blog/composer-2-5 + Composer 2 report | |
| arXiv:2603.24477): base objective is **Dr.GRPO** (no length-std normalization, | |
| single-epoch, k1-discussed/k3-in-TRL KL, async RL with MoE router replay). Its | |
| headline 2.5 technique — "targeted RL with textual feedback" — is **on-policy | |
| self-distillation** (= our SDPO channel ✓). **No DPO / preference pairs / multiple | |
| teachers appear in any Composer source** — our trace-replay-DPO channel is the | |
| framework's own addition, NOT Composer's. Recorded honestly. | |
| 2. **GRPO-family PO landscape**: vanilla GRPO (2402.03300), Dr.GRPO (2503.20783), | |
| DAPO (2503.14476, decoupled clip-higher + dynamic sampling + KL-off), GSPO | |
| (2507.18071, sequence-level importance ratio — Qwen3, MoE-stable), CISPO | |
| (MiniMax-M1 2506.13585, detached-IS REINFORCE so every token keeps a gradient), | |
| GMPO (geometric-mean aggregation). Most are surgical edits to advantage-norm / | |
| length-norm / clip / ratio-granularity / KL. | |
| 3. **TRL 1.5.0 capability map** — then **verified by introspecting the installed | |
| package** (not a GitHub snapshot): the installed trl 1.5.0 `GRPOTrainer` already | |
| implements `loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}` | |
| and exposes `epsilon`/`epsilon_high` (decoupled clip), `delta`, `beta`, | |
| `scale_rewards ∈ {group,batch,none}`, `importance_sampling_level ∈ {token, | |
| sequence}` (= GSPO), `mask_truncated_completions`, `num_iterations`. **Every | |
| objective we want is therefore PURE CONFIG — no custom `_compute_loss` needed.** | |
| ## Decision | |
| Add **`make_po_config(objective=..., **overrides)`** to `trainer/composer_trainer.py` | |
| — a named-preset factory over trl 1.5.0's verified GRPOConfig knob-space. Keep | |
| `make_dr_grpo_config` intact (back-compat); `dr_grpo` is now one preset among the menu. | |
| Menu (`PO_OBJECTIVES`): | |
| | objective | loss_type | key knobs | what it gives | | |
| |---|---|---|---| | |
| | `grpo` | grpo | scale_rewards=group, IS=token | vanilla GRPO (std-norm advantage) | | |
| | `dr_grpo` | dr_grpo | scale_rewards=none, IS=token | **default**; no length-std bias (Composer 2.5 base) | | |
| | `bnpo` | bnpo | scale_rewards=batch | batch-normalized variant | | |
| | `dapo` | dapo | epsilon_high=0.28, mask_truncated, beta=0 | decoupled clip-higher, anti-entropy-collapse | | |
| | `gspo` | grpo | IS=**sequence** | sequence-level ratio; long-CoT / MoE stable (Qwen3) | | |
| | `cispo` | cispo | epsilon_high=5.0 | detached-IS REINFORCE; every token keeps gradient | | |
| Each preset carries literature-recommended settings; any field is overridable via | |
| `**overrides`. **Drift guards** assert the requested `loss_type` / | |
| `importance_sampling_level` / `epsilon_high` actually applied, so a preset can never | |
| silently degrade (e.g. GSPO with `importance_sampling_level` overridden back to token | |
| raises rather than quietly becoming GRPO). | |
| ### Consequences | |
| - **Positive**: RL now has 6 selectable base objectives at zero custom-loss cost; the | |
| ladder/runners gain an `objective=` knob orthogonal to the existing SDPO/replay | |
| channels and the SimPO/TAID/EA-OPD distillation menu. A user can run e.g. | |
| `objective="dapo"` (clip-higher) or `objective="gspo"` (MoE-stable) instead of only | |
| Dr.GRPO. | |
| - **Positive**: faithful to Composer 2.5 (dr_grpo default + self-distillation) while | |
| exposing the stronger 2025-26 objectives the report predates. | |
| - **Neutral**: `gspo` is `grpo` loss + `importance_sampling_level="sequence"` (trl has | |
| no literal "gspo"); documented in the preset + guarded. | |
| - **Negative / honest**: presets reflect *current* trl 1.5.0 field names; a trl upgrade | |
| could rename knobs — the drift guards turn that into a loud failure, not silent | |
| mis-training. `sapo`/`luspo`/`vespo` exist in this trl build but are NOT in the menu | |
| yet (less-established; add later if validated). | |
| ## Acceptance gate | |
| - [x] `make_po_config(objective)` with presets grpo/dr_grpo/bnpo/dapo/gspo/cispo, | |
| built over trl-1.5.0-verified knobs (introspected, not assumed). | |
| - [x] `make_dr_grpo_config` preserved; `dr_grpo` preset is equivalent; default objective | |
| is `dr_grpo`. | |
| - [x] Unit tests (10) green against real trl 1.5.0: each preset's defining knob asserted, | |
| unknown-objective raises with the menu, GSPO token-override guard fires, overrides apply. | |
| - [x] Drift guards on loss_type / IS-level / epsilon_high. | |
| - [ ] Wire `objective=` through the LMA ladder runners (follow-up; the A1 run used | |
| dr_grpo — re-runnable with `objective="dapo"` etc. once threaded). | |
| ## More Information | |
| - `docs/research/SELF_DISTILLATION_LANDSCAPE.md` — the distillation/preference menu (ADR-007). | |
| - Composer 2.5 recipe report + GRPO-family survey + trl-1.5.0 capability map: research | |
| pass 2026-05-30 (parallel cross-family). | |
| - Installed-trl introspection confirming the knob surface: 2026-05-30. | |