Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Composer 2.5 Replication Framework: A Methodology Paper | |
| > **🚧 PRE-EXPERIMENTAL DRAFT (v0.0)** — methodology + economic-feasibility result + integration architecture. **No model training experiments yet.** Every empirical claim in this paper is one of: (1) a citation to upstream literature, (2) a result from spike 001 (teacher-replay cost measurement), or (3) a unit-test invariant from spike 005 (integration smoke). The full empirical validation (spike 002–004: trace collection, DPO-pair signal density, A/B vs plain GRPO) is the subject of a follow-up paper once GPU budget commits. | |
| > | |
| > **Last updated:** 2026-05-25 | |
| > **Author:** Codeseys ([HF](https://huggingface.co/Codeseys)) | |
| > **Repository:** [`Codeseys/composer-replication-framework`](https://huggingface.co/Codeseys/composer-replication-framework) | |
| > **License:** MIT (methodology + code); upstream papers and code retain their respective licenses | |
| ## Abstract | |
| Cursor's Composer 2.5 is a post-trained Kimi K2.5 that achieves frontier agentic-coding performance (~69% Terminal-Bench 2.0, parity with GPT-5.5) at 5–10× lower serving cost than peers. The recipe is dominated (~85%) by post-training, and its central non-obvious technique — *Targeted RL with Textual Feedback* — turns out to be mathematically equivalent to the published method **SDPO / OPSD** (Hübotter et al. 2026; Zhao et al. 2026), which Cursor's own footnote cites and for which **MIT-licensed reference code already exists**. | |
| Building on this, we propose a complementary novel reward channel: **multi-teacher trace-replay distillation (TR-DPO)**. After a frozen agentic rollout, replay each step under N pre-trained external teachers (e.g., Claude Opus, GPT-5, DeepSeek V4 Pro), extract DPO preference pairs from teacher–student disagreement, and add the resulting loss term *on top of* both RLVR and SDPO. The three channels do not compete for shared resources and ablate cleanly via independent weights. | |
| We make three pre-experimental contributions: (1) an audited mapping of Cursor's Composer 2.5 blog onto a stack of open infrastructure (TRL / VeRL / OpenEnv / Monarch) with primary-source-verified extension points; (2) an empirical economic feasibility result showing the trace-replay channel costs ~$0.98 per 50-step trace ungated (spike 001, n=150 calls, 0 errors); (3) a working code skeleton with 38 passing unit tests, including an end-to-end gradient-step smoke test on a tiny custom model that empirically verifies the three-channel composition trains without divergence. | |
| This paper deliberately stops short of training-result claims. Section 7 is explicit about what's *not* yet validated and what each follow-up spike will measure. | |
| ## 1. Introduction and motivation | |
| ### 1.1 What Composer 2.5 demonstrates | |
| In their May 2026 release post, Cursor announced [Composer 2.5](https://cursor.com/blog/composer-2-5), a post-trained version of [Moonshot's Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2-Thinking) that powers Cursor's agentic coding mode. Their public claims: | |
| - Substantial improvement over Composer 2 on long-horizon coding tasks | |
| - Frontier-level performance: parity with GPT-5.5 on SWE-bench Multilingual; ~69% on Terminal-Bench 2.0 | |
| - Pricing of $0.50/$2.50 per million input/output tokens — 5–10× cheaper than Opus 4.6 ($5/$25) and GPT-5.4 ($5/$22.50) | |
| - Dominant compute share goes to post-training, not pretraining (community estimate: ~85%) | |
| The blog discloses three training innovations (Section 2 expands each): | |
| 1. **Targeted RL with Textual Feedback** — a per-turn distillation loss that addresses long-horizon credit assignment. | |
| 2. **Synthetic data at 25× scale** — Feature Deletion + 24 other (unnamed) generators. | |
| 3. **Sharded Muon + Dual Mesh HSDP** — MoE optimizer infrastructure. | |
| If a small team can reproduce the *shape* of (1) and (2) without K2.5's 1T scale, the path is open to similar performance on smaller bases. Item (3) is MoE-specific infrastructure and irrelevant for dense-base reproductions. | |
| ### 1.2 What's missing from the public recipe | |
| The blog leaves three significant gaps: | |
| - **How are hints generated?** The blog gives one tool-call template ("Reminder: available tools are…") but says nothing about the generator architecture. Hardcoded templates? A separate model? The same model with an introspection prompt? This is the single largest reproducibility gap. | |
| - **What RL algorithm sits underneath?** The targeted-textual-feedback method is described as "an on-policy distillation KL loss [added to] the broader RL objective over the full trajectory." The "broader RL objective" is unspecified. | |
| - **What reward-hacking safeguards work in practice?** Cursor explicitly mentions failure modes (decompiling Java bytecode, reverse-engineering Python type-checking caches) without disclosing the mitigations beyond "agentic monitoring tools." | |
| ### 1.3 What we propose | |
| This paper makes a **methodological** contribution: an open-source replication framework that integrates three reward channels in a single trainer step, on top of any HuggingFace base model, using off-the-shelf open-source infrastructure (TRL or VeRL + OpenEnv). | |
| The three channels: | |
| 1. **RLVR (Channel 1)** — verifiable scalar reward (tests pass, build succeeds). Standard. | |
| 2. **Composer hint-distill = SDPO (Channel 2)** — single-model self-distillation with hint-conditioned context, lifted from `siyan-zhao/OPSD` (MIT). This is Cursor's published method. | |
| 3. **Trace-replay multi-teacher DPO (Channel 3, novel)** — N external teachers replay each step of frozen rollouts; teacher–student disagreement becomes a DPO preference signal. | |
| Channels 2 and 3 are mechanistically *different*, not competing. Both bypass long-horizon credit assignment, but they tap different supervision sources. Section 4 makes the distinction precise. | |
| We provide: | |
| - A primary-source-audited integration architecture across the agentic-RL stack (Section 3, [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md)) | |
| - A working code skeleton with 38 passing unit tests (Section 6, [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/)) | |
| - An empirical economic feasibility verdict for Channel 3 (Section 5, [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md)) | |
| - A risk-ordered spike plan whose terminal experiment falsifies or validates the novel claim (Section 7, [`spikes/README.md`](../spikes/README.md)) | |
| We do **not** yet provide training results. That is deferred to a follow-up paper after spike 002 (trace collection on a real agentic environment), spike 003 (DPO-pair signal density on real traces), and spike 004 (A/B comparison on SWE-bench-lite). | |
| ## 2. Composer 2.5 in technical detail (audited) | |
| This section reconstructs Cursor's published recipe with explicit `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]` tags, following the audit in [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md). The distinction matters because the initial parallel-research synthesis for this project blurred which claims came from the blog versus secondary sources, and a primary-source audit caught several extrapolations. | |
| ### 2.1 Targeted RL with Textual Feedback `[BLOG-VERIFIED]` | |
| > "For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog | |
| The mechanism, exactly: | |
| - **Same model** acts as both teacher and student. There is no second model. | |
| - The teacher is "the policy at this turn, *with* a hint inserted into the context." | |
| - The student is "the policy at this turn, *without* the hint." | |
| - Loss = on-policy KL: `KL(teacher_logits_at_turn_t || student_logits_at_turn_t)`, applied **only at the problematic turn**. | |
| - Sits *on top of* an outer RLVR objective; doesn't replace it. | |
| Cursor's footnote 1 cites three self-distillation papers as background: | |
| - **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734); code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD), MIT). Single LLM as both teacher and student; teacher conditioned on privileged information (e.g., a verified solution), student sees only the question. | |
| - **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Workshop on Scaling Post-training). Generalizes OPSD to RL with rich textual feedback. Quote from the abstract: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is mathematically the same construct as Composer's targeted-textual-feedback. | |
| - **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)). | |
| The OPSD reference implementation provides a self-contained `generalized_jsd_loss` static method (signature in [Section 6](#6-implementation)) computing JSD/KL between student and teacher logits — directly liftable into any HF Trainer subclass. | |
| ### 2.2 Synthetic data at 25× scale `[BLOG-VERIFIED]` | |
| > "Composer 2.5 is trained with 25× more synthetic tasks than Composer 2." — Cursor blog | |
| One named generator: **Feature Deletion**. Take a repo with comprehensive tests; delete code such that the codebase remains functional but specific testable features are removed. The agent's task is to reimplement the deleted features so the tests pass. Tests = verifiable reward. | |
| The blog also discloses observed reward-hacking failures: the model learning to decompile Java bytecode to reconstruct deleted APIs, and reverse-engineering Python type-checking caches to recover deleted function signatures. Mitigations are described only as "agentic monitoring tools" — opaque. | |
| ### 2.3 Sharded Muon + Dual Mesh HSDP `[BLOG-VERIFIED]` | |
| MoE optimizer infrastructure. Per-attention-head and per-expert orthogonalization (Newton-Schulz) with asynchronous all-to-all communication. Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights. | |
| This is **infrastructure**, not algorithm. Relevant only at MoE-1T scale (Kimi K2.5). For the v0.1 dense-base reproductions described here (Qwen3-7B in v0.0 spike, Qwen3-32B in v0.1), it is irrelevant. Becomes relevant if the framework is later applied to a Kimi-K2.5-derivative directly. | |
| ### 2.4 Claims that are NOT in the Cursor blog (extrapolated) | |
| These claims appear in community commentary and were reproduced uncritically in the initial synthesis for this project. They are likely correct via secondary sources but are not Cursor-stated: | |
| - **"~85% of total compute is post-training"** — community consensus (HN threads, third-party substack analysis). Plausible but unverified. | |
| - **"Anyrun" environment harness with LSP / file I/O / terminal** — name "Anyrun" is not in the 2.5 blog (may be in the [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report)). | |
| - **CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual** — the 2.5 blog does not quote benchmark numbers. | |
| - **"PPO or GRPO variant"** as the underlying RL algorithm — the blog never names the RL algorithm. | |
| - **MLA + 1T total / 32B active + 384 experts + 256K context** — these are Kimi K2.5 base model facts, [verified independently](https://huggingface.co/moonshotai) but inferred rather than blog-stated. | |
| The blog is unambiguous on the three items in §2.1–§2.3 and otherwise terse. | |
| ## 3. Integration architecture across the agentic-RL stack | |
| The three reward channels need to compose inside a single trainer step, regardless of which RL framework hosts them. We audited (via [DeepWiki](https://deepwiki.com/) on 2026-05-25) the extension points of the major open-source agentic-RL frameworks and produced an integration matrix. | |
| ### 3.1 Frameworks surveyed | |
| | Framework | Role | Status | | |
| |---|---|---| | |
| | [HuggingFace TRL](https://github.com/huggingface/trl) | Reference algorithm library; `GRPOTrainer` is the workhorse for RLVR | Mature (v1.0, 2026-03); developer-friendly; OpenEnv integration since 2025-10 | | |
| | [ByteDance VeRL](https://github.com/volcengine/verl) | Production-scale RL via HybridFlow + 3D-HybridEngine; Ray-based | Proven 671B; preferred for ≥70B runs | | |
| | [Meta TorchForge](https://github.com/meta-pytorch/forge) | RL post-training on Monarch; reference recipes | **"Development paused — consolidating into TorchTitan"** (banner). Use as pattern reference only. | | |
| | [Meta Monarch](https://github.com/meta-pytorch/monarch) | Single-controller actor-mesh framework with RDMA data plane | Active; native PyTorch integration | | |
| | [Meta OpenEnv](https://github.com/meta-pytorch/OpenEnv) | Standard for agentic environments (typed reset/step/close + MCP RFC) | First-party TRL integration; HF Hub catalog; growing community | | |
| | [Prime Intellect PRIME-RL](https://github.com/PrimeIntellect-ai/prime-rl) | Decentralized RL substrate (INTELLECT-2, 32B globally distributed) | Production-deployed; `verifiers` env library | | |
| ### 3.2 Extension-point matrix (verified) | |
| The full table is in [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). Summary: | |
| | Channel | TRL | VeRL | | |
| |---|---|---| | |
| | **1. RLVR/GRPO** | `GRPOTrainer._compute_loss(model, inputs)` (base behavior) | `@register_adv_est("grpo")` → `core_algos.compute_grpo_outcome_advantage` | | |
| | **2. SDPO** | Subclass override of `_compute_loss`; lift `generalized_jsd_loss` from OPSD | New `@register_adv_est("grpo_sdpo")`; reads `data.batch["sdpo_teacher_logprobs"]`; precedent: distillation already attaches `teacher_log_probs` | | |
| | **3. TR-DPO** | Subclass override; add DPO term using `inputs["dpo_chosen_input_ids"]`, etc. | Custom estimator reading `data.non_tensor_batch["teacher_actions"]` | | |
| | **OpenEnv plumbing** | `environment_factory=` kwarg in trainer init | Custom env worker producing `DataProto`-shaped output | | |
| Both paths allow the integrated trainer to be assembled out of small components — none of the three channels requires modifying the framework's core. The full unified loss is: | |
| ``` | |
| total_loss = grpo_loss + α · sdpo_kl_loss + β · trace_replay_dpo_loss | |
| ``` | |
| where `α = β = 0` recovers plain GRPO (the baseline arm of any ablation). | |
| ### 3.3 Why this matters | |
| The architectural constraint that lets all three channels co-exist: **they don't compete for shared resources.** | |
| | Resource | Channel 1 (RLVR) | Channel 2 (SDPO) | Channel 3 (TR-DPO) | | |
| |---|---|---|---| | |
| | GPU forward pass (rollout) | yes (vLLM, async) | no | no | | |
| | GPU forward pass (training) | yes | one extra per error site (~5% of tokens) | none — uses precomputed logprobs | | |
| | GPU backward pass | yes | yes | yes | | |
| | External API budget | none | none | $0.30–1 per 50-step trace | | |
| | Latency-critical path | yes — gates next rollout | minor | no — async, post-rollout | | |
| Channel 2 is forward-pass-bound (training-side, sparse). Channel 3 is API-bound (offline, post-rollout). They don't fight for the same compute. | |
| ## 4. The novel channel: multi-teacher trace-replay distillation | |
| ### 4.1 Construction | |
| After a frozen agentic rollout (state\_t, action\_t, reward), for each step `t`: | |
| 1. **Replay** the exact state at step `t` against `N` pre-trained external teachers (different model families). | |
| 2. **Extract** each teacher's chosen action (or action distribution) at that state. | |
| 3. **Score** the disagreement: if `k ≥ k_threshold` of `N` teachers agree on action `X` and the student picked `Y ≠ X`, emit a DPO preference pair `(chosen=X, rejected=Y)`. | |
| 4. **Train** with standard DPO loss on the pair set, layered onto GRPO + (optionally) SDPO. | |
| This is offline relative to the rollout — teacher API calls happen post-rollout, after which the DPO pairs are batched into the next training step. No latency-critical coupling. | |
| ### 4.2 Distinction from related work | |
| | Work | Mechanism | Difference from TR-DPO | | |
| |---|---|---| | |
| | [rStar / rStar-Math](https://arxiv.org/abs/2408.06195) (Microsoft) | MCTS at training time, single teacher branches at each step | TR-DPO replays pre-existing traces (not MCTS); uses N teachers (not single). | | |
| | [Math-Shepherd / OmegaPRM](https://arxiv.org/abs/2312.08935) | Process reward models from rollout-and-check | Reward signal is teacher *disagreement*, not rollout outcomes. | | |
| | [Magpie](https://arxiv.org/abs/2406.08464) / OpenThoughts | Synthetic data from one strong teacher | Per-step distillation from N teachers on real traces. | | |
| | [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) (Wang et al.) | Multi-teacher *response-level* aggregation | Per-step (sub-response) aggregation. | | |
| | **Composer SDPO / OPSD** | Single-model self-teacher with hint context | TR-DPO uses N *external* teachers; complementary, not competing. | | |
| To our knowledge no published work systematically replays each step of frozen agentic traces with multiple external teachers to harvest step-level supervision. The trace-replay-with-N-teachers construction appears to be open territory. | |
| ### 4.3 Stacking with SDPO | |
| | Property | Composer SDPO | TR-DPO (this work) | | |
| |---|---|---| | |
| | Number of models | 1 | N + 1 | | |
| | Teacher source | Same model with privileged context | External pretrained models | | |
| | Per-step compute | One extra forward pass | None (precomputed) | | |
| | Per-step API cost | Zero | ~$0.02 (3-teacher, ungated) | | |
| | Privileged signal | Hint text in context | None — teachers see same state | | |
| | Bypasses long-horizon credit assignment | Yes (per-turn KL) | Yes (per-step DPO) | | |
| | Published code | Yes — `siyan-zhao/OPSD` | Not yet | | |
| Both add dense per-step signal on top of RLVR. Their gradient contributions are independent because they update via separate loss terms with separate weights. | |
| ## 5. Empirical results so far | |
| ### 5.1 Spike 001 — teacher-replay cost floor (✅ VALIDATED) | |
| **Question:** Given a 50-step agentic-coding trace, what's the API cost and wallclock latency of querying N=3 frontier teachers in parallel for next-action distributions at every step? | |
| **Method.** We synthesized 50 hand-crafted SWE-bench-lite-shaped agentic states (multi-turn function-call decision points; ~250–500 tokens of context each). For each state we issued parallel async requests to three teachers via OpenRouter: | |
| - `anthropic/claude-opus-4.7` (Anthropic frontier) | |
| - `openai/gpt-5` (OpenAI frontier) | |
| - `deepseek/deepseek-v4-pro` (open-weight frontier) | |
| Total: 150 calls. Hard-cap at $20 spend (early abort on overrun). | |
| **Result.** | |
| | Metric | Target | Actual | Pass? | | |
| |---|---|---|---| | |
| | Mean per-trace cost (50 steps × 3 teachers, ungated) | < $5 | **$0.98** | ✅ 5× headroom | | |
| | p95 step latency (max across 3 parallel teachers) | < 30 s | **20.45 s** | ✅ | | |
| | p99 step latency | < 60 s | **23.24 s** | ✅ | | |
| | Errors | 0 expected | **0 / 150** | ✅ | | |
| **Per-teacher breakdown.** | |
| | Teacher | n | p50 lat | p95 lat | mean $/call | total $ | | |
| |---|---|---|---|---|---| | |
| | `anthropic/claude-opus-4.7` | 50 | 3.4 s | 4.6 s | $0.0161 | $0.81 | | |
| | `openai/gpt-5` | 50 | 5.0 s | 10.1 s | $0.0021 | $0.11 | | |
| | `deepseek/deepseek-v4-pro` | 50 | 7.1 s | 16.2 s | $0.0013 | $0.07 | | |
| Opus dominates per-trace cost (~83%). With v0.1 VOI gating (only query teachers when student entropy is high; typically 60–80% reduction), projected per-trace cost falls to ~$0.30. Opus could optionally be dropped or replaced with Sonnet 4.6 for further savings. | |
| The economic floor is well within budget. **Channel 3 is viable.** | |
| Full data at [`spikes/001-teacher-replay-cost/`](../spikes/001-teacher-replay-cost/) (synthesize_trace.py + replay.py + analyze.py + verdict.md). Code is reproducible — set `OPENROUTER_API_KEY` and run `python synthesize_trace.py && python replay.py && python analyze.py`. | |
| ### 5.2 Spike 005 — integration architecture (✅ COMPOSITION-VERIFIED) | |
| **Question:** Does the proposed three-channel integration (RLVR + SDPO + TR-DPO) compose cleanly in a real PyTorch trainer? Specifically: are the loss terms additive without unwanted interactions; do α/β=0 ablations correctly recover plain GRPO; and does a multi-channel run actually decrease loss on a tiny model? | |
| **Method.** A code skeleton at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) implements: | |
| - **`opsd_loss.py`** — `generalized_jsd_loss` lifted verbatim from `siyan-zhao/OPSD` (MIT). Self-contained static method computing JSD/KL between student and teacher logits. | |
| - **`teacher_replay.py`** — parallel OpenRouter client + DPO-pair extractor (from teacher–student disagreement at the agreement threshold). | |
| - **`hint_generator.py`** — template-based hint dispatcher keyed by error_kind (v0.1 starter; LLM-driven hints in v0.2). | |
| - **`trl_path/data_collator.py`** — `ComposerDataCollator` transforming raw trace + DPO pairs into the trainer batch dict. | |
| - **`trl_path/composer_trainer.py`** — `ComposerReplicationTrainer(GRPOTrainer)` with `_compute_loss` override. | |
| - **`verl_path/composer_adv.py`** — `@register_adv_est("grpo_composer")` for VeRL. | |
| **Result: 38/38 unit tests pass in 3.43 s** (`python3 -m pytest tests/ -v`). | |
| | Test module | Tests | Status | | |
| |---|---|---| | |
| | `test_opsd_loss.py` (lifted OPSD math) | 9 | ✅ all pass | | |
| | `test_teacher_replay.py` (DPO-pair extraction) | 7 | ✅ all pass | | |
| | `test_data_collator.py` (raw trace → batch) | 15 | ✅ all pass | | |
| | `test_loss_composition_smoke.py` (3-channel + ablation + 5-step train) | 7 | ✅ all pass | | |
| The composition smoke test runs all three channels on a `TinyLM(vocab=64, hidden=32)` (~10K parameters). Verifies: | |
| - **Ablation invariants:** `α=0, β=0` reduces exactly to GRPO; α-only adds SDPO; β-only adds DPO; full = sum. | |
| - **Gradient finiteness:** Every model parameter receives a finite gradient with all three channels active. | |
| - **Training behavior:** A 5-step train run with all three channels active *decreases* total loss (overfitting check on a fixed batch). The channels do not actively fight each other. | |
| - **Robustness to mixed batches:** When the data collator emits no SDPO inputs (no error sites in batch), the loss correctly bypasses the SDPO term even with α=1. | |
| This empirically tests the integration claim that was purely architectural in earlier draft material. | |
| ### 5.3 What spike 005 does NOT prove | |
| The smoke test is on a 10K-parameter model with placeholder GRPO loss (cross-entropy on a synthetic target sequence) instead of the real GRPO advantage / group-relative computation. It demonstrates **wiring correctness**, not training quality. The real training experiments are in spikes 002–004. | |
| ## 6. Implementation | |
| ### 6.1 Lifted SDPO loss | |
| Verbatim port from `siyan-zhao/OPSD` (DeepWiki-verified self-contained, MIT licensed): | |
| ```python | |
| def generalized_jsd_loss( | |
| student_logits: torch.Tensor, # (B, T, V) | |
| teacher_logits: torch.Tensor, # (B, T, V) — same model, hint-conditioned context | |
| labels: torch.Tensor | None = None, # -100 = ignore | |
| beta: float = 0.5, # 0=fwd KL, 1=rev KL, 0.5=JSD | |
| temperature: float = 1.0, | |
| reduction: str = "batchmean", | |
| top_k: int | None = None, | |
| token_clip: float | None = None, | |
| ) -> torch.Tensor: ... | |
| ``` | |
| Full implementation at [`spikes/005-integrated-trainer-skeleton/opsd_loss.py`](../spikes/005-integrated-trainer-skeleton/opsd_loss.py). | |
| ### 6.2 TRL trainer subclass | |
| ```python | |
| class ComposerReplicationTrainer(GRPOTrainer): | |
| def __init__(self, *args, alpha_sdpo=0.1, beta_replay=0.05, **kwargs): | |
| super().__init__(*args, **kwargs) | |
| self.alpha_sdpo = alpha_sdpo | |
| self.beta_replay = beta_replay | |
| def _compute_loss(self, model, inputs): | |
| grpo_loss = super()._compute_loss(model, inputs) | |
| sdpo_kl = self._compute_sdpo_loss(model, inputs) | |
| replay_dpo = self._compute_trace_replay_loss(model, inputs) | |
| return grpo_loss + self.alpha_sdpo * sdpo_kl + self.beta_replay * replay_dpo | |
| ``` | |
| Full implementation at [`spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`](../spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py). | |
| ### 6.3 VeRL custom advantage estimator | |
| ```python | |
| @register_adv_est("grpo_composer") | |
| def compute_grpo_composer_advantage(token_level_rewards, eos_mask, index, *, | |
| sdpo_teacher_logprobs=None, alpha_sdpo=0.0, | |
| teacher_consensus_prm=None, beta_replay=0.0, | |
| **kwargs): | |
| base_adv = core_algos.compute_grpo_outcome_advantage(token_level_rewards, eos_mask, index) | |
| if alpha_sdpo and sdpo_teacher_logprobs is not None: | |
| base_adv = base_adv + alpha_sdpo * (sdpo_teacher_logprobs - kwargs["old_log_prob"]) * kwargs["sdpo_error_mask"] | |
| if beta_replay and teacher_consensus_prm is not None: | |
| base_adv = base_adv + beta_replay * teacher_consensus_prm | |
| return base_adv | |
| ``` | |
| Full implementation at [`spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py`](../spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py). | |
| ### 6.4 Data collator | |
| `ComposerDataCollator` consumes a list of `TraceExample` dicts (with optional `dpo_pairs` from `teacher_replay.extract_dpo_pairs`) and emits the batch dict the trainer expects: | |
| - Channel 1: `input_ids`, `attention_mask`, `response_mask`, `rewards` | |
| - Channel 2: `ctx_teacher_input_ids` (with hint inserted at error-turn boundary), `sdpo_loss_mask` (1 at post-hint tokens, -100 elsewhere) | |
| - Channel 3: `dpo_chosen_input_ids`, `dpo_rejected_input_ids`, `*_response_mask` | |
| Full implementation at [`spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py`](../spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py). | |
| ## 7. What's NOT proven (and how the follow-up spikes will measure it) | |
| This paper is explicit about the gap between "framework + economic feasibility" and "this method works." The following claims **are not** yet validated: | |
| | Claim | Status | Validating spike | | |
| |---|---|---| | |
| | TR-DPO improves SWE-bench-lite pass@1 over plain GRPO | Open | **Spike 004**: A/B Qwen3-7B trained with GRPO vs GRPO+TR-DPO; ≥2 pt pass@1 improvement at p<0.05 over 3 seeds is the success criterion. ~$300 GPU + $50 eval. | | |
| | Teacher disagreement at the step level carries non-trivial preference signal on real traces | Open | **Spike 003**: extract DPO pairs from spike-002 traces; report pairs/trace, KL distance from random pairs. | | |
| | TRL's GRPOTrainer with `environment_factory=` cleanly emits trace JSONL | Open | **Spike 002a**: 100 rollouts on Qwen3-7B + SWE-bench-lite via TRL. | | |
| | PRIME-RL produces equivalently clean trace export | Open | **Spike 002b**: head-to-head with 002a. | | |
| | The trained variant matches Composer 2.5 quality at 32B scale | Open (v0.1) | Out of scope for this paper. v0.1 follow-up. | | |
| **Honest framing for what this paper does and does not contribute:** | |
| - ✅ **Contributes:** an integration architecture verified at the code level, with reusable components, public code, and economic feasibility for the novel channel. A reviewer can use this paper to design and budget a real experiment. | |
| - ❌ **Does not contribute:** evidence that the method actually trains better models. That requires the GPU spike. We are deliberately not making that claim until experiments back it. | |
| ## 8. Reward-hacking safeguards (proposed for v0.1) | |
| Cursor's blog mentions specific reward-hacking failures (Java bytecode decompilation, Python type-cache reverse-engineering) without disclosing mitigations. We propose: | |
| - **Sandbox hardening** — disable `find`, `unzip`, `strings`, `objdump`, and similar introspection tools in the OpenEnv container; clear `__pycache__` between rollouts; `PYTHONHASHSEED` randomized. | |
| - **Static-analysis monitor** — flag rollouts that read or write paths matching `__pycache__/`, `.pyc`, `*.class` files, or unzip operations. | |
| - **Reward-model penalty** — train a small RM on annotated reward-hacking examples; subtract its score from the RLVR signal. | |
| These are pre-experiment proposals; their efficacy is part of the v0.1 follow-up. | |
| ## 9. Limitations | |
| - **Single-snapshot research.** All upstream literature surveyed on 2026-05-25. The ecosystem moves fast: Forge may un-pause; OpenEnv may fork; PRIME-RL may consolidate. Re-survey every 6 months. | |
| - **No primary access to Cursor's pipeline.** All Composer 2.5 details come from the public blog. Critical gaps (hint generator architecture, exact RL algorithm) remain. | |
| - **Trace-replay novelty claim is weak in negation.** Absence of evidence in the surveyed literature is not strong evidence of absence. We may have missed adjacent work (e.g., on the "rich-feedback RLVR" axis the SDPO paper introduced). | |
| - **Economic feasibility ≠ training-improvement evidence.** Spike 001 establishes the method is *affordable*. Whether it produces better models is the spike-004 question. | |
| - **Pre-experimental publication risks.** Releasing a methodology before experimental validation has known failure modes (other groups racing the experiment; methodology errors caught only after publication). We accept this risk in exchange for early-feedback signal from the community. | |
| ## 10. Conclusion | |
| We presented a methodology and integration architecture for replicating Cursor's Composer 2.5 recipe on an open-source stack, plus a novel multi-teacher trace-replay distillation channel. We grounded the Composer-2.5 mechanism in published prior art (OPSD/SDPO with MIT-licensed code), audited integration extension points across TRL/VeRL/OpenEnv with primary-source verification, and empirically validated the economic floor of the novel channel ($0.98/trace, 5× cap headroom) plus the architectural composition claim (38/38 unit tests, 5-step train decreases loss). | |
| The full empirical validation — does TR-DPO actually improve SWE-bench-lite pass@1 over plain GRPO at 7B? — is the subject of the follow-up paper. | |
| We invite collaboration. The repository is public. The integration architecture is component-friendly: any of the three channels can be ablated or substituted independently. If you're working on agentic-coding RL post-training and want to test additional channels (e.g., a fourth using Mixture-of-Agents-style aggregation, or test-time compute), the framework slots that in by adding another loss term with its own weight. | |
| ## Acknowledgements | |
| This work builds on: | |
| - **Cursor team** for the Composer 2.5 release and its three named training innovations. | |
| - **Siyan Zhao et al.** for OPSD and the open `siyan-zhao/OPSD` reference implementation (MIT). | |
| - **Hübotter et al.** for SDPO, which formalizes the Cursor mechanism. | |
| - **HuggingFace TRL team** for `GRPOTrainer` and the OpenEnv integration. | |
| - **ByteDance VeRL team** for the HybridFlow / 3D-HybridEngine architecture. | |
| - **Meta PyTorch team** for Monarch + OpenEnv. | |
| - **Prime Intellect team** for PRIME-RL and the INTELLECT-2 decentralized run. | |
| - The five LLM-family research notes in [`research/`](../research/) (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Claude Sonnet 4.6, Kimi K2-Thinking) for cross-family verification of the framework synthesis. | |
| ## Citation | |
| If you build on this framework, cite as: | |
| ```bibtex | |
| @misc{composer-replication-framework-2026, | |
| author = {Codeseys}, | |
| title = {Composer 2.5 Replication Framework: Methodology and Integration Architecture for Open Replication of Cursor's Agentic Coding Recipe}, | |
| year = {2026}, | |
| publisher = {HuggingFace}, | |
| howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}}, | |
| note = {Pre-experimental v0.0 draft. Methodology, integration architecture, and economic-feasibility result. Empirical validation in follow-up paper.} | |
| } | |
| ``` | |
| Underlying primary sources: | |
| ```bibtex | |
| @article{cursor2026composer25, | |
| title = {Introducing Composer 2.5}, | |
| author = {{Cursor Team}}, | |
| year = {2026}, | |
| url = {https://cursor.com/blog/composer-2-5} | |
| } | |
| @article{zhao2026opsd, | |
| title = {Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models}, | |
| author = {Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya}, | |
| year = {2026}, | |
| journal = {arXiv preprint arXiv:2601.18734} | |
| } | |
| @article{hubotter2026sdpo, | |
| title = {Reinforcement Learning via Self-Distillation}, | |
| author = {H{\"u}botter, Jonas and L{\"u}beck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Buening, Thomas Kleine and Guestrin, Carlos and Krause, Andreas}, | |
| year = {2026}, | |
| journal = {arXiv preprint arXiv:2601.20802}, | |
| note = {ICLR 2026 Scaling Post-training Workshop} | |
| } | |
| ``` | |
| ## Repository | |
| Methodology document, audited research notes, integration architecture, working code skeleton, spike plan, and all supporting artifacts: | |
| **🤗 https://huggingface.co/Codeseys/composer-replication-framework** | |
| Discussion: open a [Discussion](https://huggingface.co/Codeseys/composer-replication-framework/discussions) on the repo for technical questions, corrections, or collaboration interest. | |
| --- | |
| *Last revised 2026-05-25 (Wave 4: data collator + loss composition smoke).* | |
| *This is a living document. v0.1 will incorporate spike 002–004 results.* | |