baladithyab

Wave 5: full publication-materials drafts (pre-experimental release set)

639a760 14 days ago

33.6 kB

	# Composer 2.5 Replication Framework: A Methodology Paper

	> 🚧 PRE-EXPERIMENTAL DRAFT (v0.0) — methodology + economic-feasibility result + integration architecture. No model training experiments yet. Every empirical claim in this paper is one of: (1) a citation to upstream literature, (2) a result from spike 001 (teacher-replay cost measurement), or (3) a unit-test invariant from spike 005 (integration smoke). The full empirical validation (spike 002–004: trace collection, DPO-pair signal density, A/B vs plain GRPO) is the subject of a follow-up paper once GPU budget commits.
	>
	> Last updated: 2026-05-25
	> Author: Codeseys ([HF](https://huggingface.co/Codeseys))
	> Repository: [`Codeseys/composer-replication-framework`](https://huggingface.co/Codeseys/composer-replication-framework)
	> License: MIT (methodology + code); upstream papers and code retain their respective licenses

	## Abstract

	Cursor's Composer 2.5 is a post-trained Kimi K2.5 that achieves frontier agentic-coding performance (~69% Terminal-Bench 2.0, parity with GPT-5.5) at 5–10× lower serving cost than peers. The recipe is dominated (~85%) by post-training, and its central non-obvious technique — Targeted RL with Textual Feedback — turns out to be mathematically equivalent to the published method SDPO / OPSD (Hübotter et al. 2026; Zhao et al. 2026), which Cursor's own footnote cites and for which MIT-licensed reference code already exists.

	Building on this, we propose a complementary novel reward channel: multi-teacher trace-replay distillation (TR-DPO). After a frozen agentic rollout, replay each step under N pre-trained external teachers (e.g., Claude Opus, GPT-5, DeepSeek V4 Pro), extract DPO preference pairs from teacher–student disagreement, and add the resulting loss term on top of both RLVR and SDPO. The three channels do not compete for shared resources and ablate cleanly via independent weights.

	We make three pre-experimental contributions: (1) an audited mapping of Cursor's Composer 2.5 blog onto a stack of open infrastructure (TRL / VeRL / OpenEnv / Monarch) with primary-source-verified extension points; (2) an empirical economic feasibility result showing the trace-replay channel costs ~$0.98 per 50-step trace ungated (spike 001, n=150 calls, 0 errors); (3) a working code skeleton with 38 passing unit tests, including an end-to-end gradient-step smoke test on a tiny custom model that empirically verifies the three-channel composition trains without divergence.

	This paper deliberately stops short of training-result claims. Section 7 is explicit about what's not yet validated and what each follow-up spike will measure.

	## 1. Introduction and motivation

	### 1.1 What Composer 2.5 demonstrates

	In their May 2026 release post, Cursor announced [Composer 2.5](https://cursor.com/blog/composer-2-5), a post-trained version of [Moonshot's Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2-Thinking) that powers Cursor's agentic coding mode. Their public claims:

	- Substantial improvement over Composer 2 on long-horizon coding tasks
	- Frontier-level performance: parity with GPT-5.5 on SWE-bench Multilingual; ~69% on Terminal-Bench 2.0
	- Pricing of $0.50/$2.50 per million input/output tokens — 5–10× cheaper than Opus 4.6 ($5/$25) and GPT-5.4 ($5/$22.50)
	- Dominant compute share goes to post-training, not pretraining (community estimate: ~85%)

	The blog discloses three training innovations (Section 2 expands each):

	1. Targeted RL with Textual Feedback — a per-turn distillation loss that addresses long-horizon credit assignment.
	2. Synthetic data at 25× scale — Feature Deletion + 24 other (unnamed) generators.
	3. Sharded Muon + Dual Mesh HSDP — MoE optimizer infrastructure.

	If a small team can reproduce the shape of (1) and (2) without K2.5's 1T scale, the path is open to similar performance on smaller bases. Item (3) is MoE-specific infrastructure and irrelevant for dense-base reproductions.

	### 1.2 What's missing from the public recipe

	The blog leaves three significant gaps:

	- How are hints generated? The blog gives one tool-call template ("Reminder: available tools are…") but says nothing about the generator architecture. Hardcoded templates? A separate model? The same model with an introspection prompt? This is the single largest reproducibility gap.
	- What RL algorithm sits underneath? The targeted-textual-feedback method is described as "an on-policy distillation KL loss [added to] the broader RL objective over the full trajectory." The "broader RL objective" is unspecified.
	- What reward-hacking safeguards work in practice? Cursor explicitly mentions failure modes (decompiling Java bytecode, reverse-engineering Python type-checking caches) without disclosing the mitigations beyond "agentic monitoring tools."

	### 1.3 What we propose

	This paper makes a methodological contribution: an open-source replication framework that integrates three reward channels in a single trainer step, on top of any HuggingFace base model, using off-the-shelf open-source infrastructure (TRL or VeRL + OpenEnv).

	The three channels:

	1. RLVR (Channel 1) — verifiable scalar reward (tests pass, build succeeds). Standard.
	2. Composer hint-distill = SDPO (Channel 2) — single-model self-distillation with hint-conditioned context, lifted from `siyan-zhao/OPSD` (MIT). This is Cursor's published method.
	3. Trace-replay multi-teacher DPO (Channel 3, novel) — N external teachers replay each step of frozen rollouts; teacher–student disagreement becomes a DPO preference signal.

	Channels 2 and 3 are mechanistically different, not competing. Both bypass long-horizon credit assignment, but they tap different supervision sources. Section 4 makes the distinction precise.

	We provide:

	- A primary-source-audited integration architecture across the agentic-RL stack (Section 3, [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md))
	- A working code skeleton with 38 passing unit tests (Section 6, [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/))
	- An empirical economic feasibility verdict for Channel 3 (Section 5, [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md))
	- A risk-ordered spike plan whose terminal experiment falsifies or validates the novel claim (Section 7, [`spikes/README.md`](../spikes/README.md))

	We do not yet provide training results. That is deferred to a follow-up paper after spike 002 (trace collection on a real agentic environment), spike 003 (DPO-pair signal density on real traces), and spike 004 (A/B comparison on SWE-bench-lite).

	## 2. Composer 2.5 in technical detail (audited)

	This section reconstructs Cursor's published recipe with explicit `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]` tags, following the audit in [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md). The distinction matters because the initial parallel-research synthesis for this project blurred which claims came from the blog versus secondary sources, and a primary-source audit caught several extrapolations.

	### 2.1 Targeted RL with Textual Feedback `[BLOG-VERIFIED]`

	> "For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog

	The mechanism, exactly:

	- Same model acts as both teacher and student. There is no second model.
	- The teacher is "the policy at this turn, with a hint inserted into the context."
	- The student is "the policy at this turn, without the hint."
	- Loss = on-policy KL: `KL(teacher_logits_at_turn_t \|\| student_logits_at_turn_t)`, applied only at the problematic turn.
	- Sits on top of an outer RLVR objective; doesn't replace it.

	Cursor's footnote 1 cites three self-distillation papers as background:

	- OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734); code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD), MIT). Single LLM as both teacher and student; teacher conditioned on privileged information (e.g., a verified solution), student sees only the question.
	- SDPO: Reinforcement Learning via Self-Distillation (Hübotter et al., [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Workshop on Scaling Post-training). Generalizes OPSD to RL with rich textual feedback. Quote from the abstract: "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy." This is mathematically the same construct as Composer's targeted-textual-feedback.
	- Self-Distillation Enables Continual Learning ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).

	The OPSD reference implementation provides a self-contained `generalized_jsd_loss` static method (signature in [Section 6](#6-implementation)) computing JSD/KL between student and teacher logits — directly liftable into any HF Trainer subclass.

	### 2.2 Synthetic data at 25× scale `[BLOG-VERIFIED]`

	> "Composer 2.5 is trained with 25× more synthetic tasks than Composer 2." — Cursor blog

	One named generator: Feature Deletion. Take a repo with comprehensive tests; delete code such that the codebase remains functional but specific testable features are removed. The agent's task is to reimplement the deleted features so the tests pass. Tests = verifiable reward.

	The blog also discloses observed reward-hacking failures: the model learning to decompile Java bytecode to reconstruct deleted APIs, and reverse-engineering Python type-checking caches to recover deleted function signatures. Mitigations are described only as "agentic monitoring tools" — opaque.

	### 2.3 Sharded Muon + Dual Mesh HSDP `[BLOG-VERIFIED]`

	MoE optimizer infrastructure. Per-attention-head and per-expert orthogonalization (Newton-Schulz) with asynchronous all-to-all communication. Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.

	This is infrastructure, not algorithm. Relevant only at MoE-1T scale (Kimi K2.5). For the v0.1 dense-base reproductions described here (Qwen3-7B in v0.0 spike, Qwen3-32B in v0.1), it is irrelevant. Becomes relevant if the framework is later applied to a Kimi-K2.5-derivative directly.

	### 2.4 Claims that are NOT in the Cursor blog (extrapolated)

	These claims appear in community commentary and were reproduced uncritically in the initial synthesis for this project. They are likely correct via secondary sources but are not Cursor-stated:

	- "~85% of total compute is post-training" — community consensus (HN threads, third-party substack analysis). Plausible but unverified.
	- "Anyrun" environment harness with LSP / file I/O / terminal — name "Anyrun" is not in the 2.5 blog (may be in the [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report)).
	- CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual — the 2.5 blog does not quote benchmark numbers.
	- "PPO or GRPO variant" as the underlying RL algorithm — the blog never names the RL algorithm.
	- MLA + 1T total / 32B active + 384 experts + 256K context — these are Kimi K2.5 base model facts, [verified independently](https://huggingface.co/moonshotai) but inferred rather than blog-stated.

	The blog is unambiguous on the three items in §2.1–§2.3 and otherwise terse.

	## 3. Integration architecture across the agentic-RL stack

	The three reward channels need to compose inside a single trainer step, regardless of which RL framework hosts them. We audited (via [DeepWiki](https://deepwiki.com/) on 2026-05-25) the extension points of the major open-source agentic-RL frameworks and produced an integration matrix.

	### 3.1 Frameworks surveyed

	\| Framework \| Role \| Status \|
	\|---\|---\|---\|
	\| [HuggingFace TRL](https://github.com/huggingface/trl) \| Reference algorithm library; `GRPOTrainer` is the workhorse for RLVR \| Mature (v1.0, 2026-03); developer-friendly; OpenEnv integration since 2025-10 \|
	\| [ByteDance VeRL](https://github.com/volcengine/verl) \| Production-scale RL via HybridFlow + 3D-HybridEngine; Ray-based \| Proven 671B; preferred for ≥70B runs \|
	\| [Meta TorchForge](https://github.com/meta-pytorch/forge) \| RL post-training on Monarch; reference recipes \| "Development paused — consolidating into TorchTitan" (banner). Use as pattern reference only. \|
	\| [Meta Monarch](https://github.com/meta-pytorch/monarch) \| Single-controller actor-mesh framework with RDMA data plane \| Active; native PyTorch integration \|
	\| [Meta OpenEnv](https://github.com/meta-pytorch/OpenEnv) \| Standard for agentic environments (typed reset/step/close + MCP RFC) \| First-party TRL integration; HF Hub catalog; growing community \|
	\| [Prime Intellect PRIME-RL](https://github.com/PrimeIntellect-ai/prime-rl) \| Decentralized RL substrate (INTELLECT-2, 32B globally distributed) \| Production-deployed; `verifiers` env library \|

	### 3.2 Extension-point matrix (verified)

	The full table is in [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). Summary:

	\| Channel \| TRL \| VeRL \|
	\|---\|---\|---\|
	\| 1. RLVR/GRPO \| `GRPOTrainer._compute_loss(model, inputs)` (base behavior) \| `@register_adv_est("grpo")` → `core_algos.compute_grpo_outcome_advantage` \|
	\| 2. SDPO \| Subclass override of `_compute_loss`; lift `generalized_jsd_loss` from OPSD \| New `@register_adv_est("grpo_sdpo")`; reads `data.batch["sdpo_teacher_logprobs"]`; precedent: distillation already attaches `teacher_log_probs` \|
	\| 3. TR-DPO \| Subclass override; add DPO term using `inputs["dpo_chosen_input_ids"]`, etc. \| Custom estimator reading `data.non_tensor_batch["teacher_actions"]` \|
	\| OpenEnv plumbing \| `environment_factory=` kwarg in trainer init \| Custom env worker producing `DataProto`-shaped output \|

	Both paths allow the integrated trainer to be assembled out of small components — none of the three channels requires modifying the framework's core. The full unified loss is:

	```
	total_loss = grpo_loss + α · sdpo_kl_loss + β · trace_replay_dpo_loss
	```

	where `α = β = 0` recovers plain GRPO (the baseline arm of any ablation).

	### 3.3 Why this matters

	The architectural constraint that lets all three channels co-exist: they don't compete for shared resources.

	\| Resource \| Channel 1 (RLVR) \| Channel 2 (SDPO) \| Channel 3 (TR-DPO) \|
	\|---\|---\|---\|---\|
	\| GPU forward pass (rollout) \| yes (vLLM, async) \| no \| no \|
	\| GPU forward pass (training) \| yes \| one extra per error site (~5% of tokens) \| none — uses precomputed logprobs \|
	\| GPU backward pass \| yes \| yes \| yes \|
	\| External API budget \| none \| none \| $0.30–1 per 50-step trace \|
	\| Latency-critical path \| yes — gates next rollout \| minor \| no — async, post-rollout \|

	Channel 2 is forward-pass-bound (training-side, sparse). Channel 3 is API-bound (offline, post-rollout). They don't fight for the same compute.

	## 4. The novel channel: multi-teacher trace-replay distillation

	### 4.1 Construction

	After a frozen agentic rollout (state\_t, action\_t, reward), for each step `t`:

	1. Replay the exact state at step `t` against `N` pre-trained external teachers (different model families).
	2. Extract each teacher's chosen action (or action distribution) at that state.
	3. Score the disagreement: if `k ≥ k_threshold` of `N` teachers agree on action `X` and the student picked `Y ≠ X`, emit a DPO preference pair `(chosen=X, rejected=Y)`.
	4. Train with standard DPO loss on the pair set, layered onto GRPO + (optionally) SDPO.

	This is offline relative to the rollout — teacher API calls happen post-rollout, after which the DPO pairs are batched into the next training step. No latency-critical coupling.

	### 4.2 Distinction from related work

	\| Work \| Mechanism \| Difference from TR-DPO \|
	\|---\|---\|---\|
	\| [rStar / rStar-Math](https://arxiv.org/abs/2408.06195) (Microsoft) \| MCTS at training time, single teacher branches at each step \| TR-DPO replays pre-existing traces (not MCTS); uses N teachers (not single). \|
	\| [Math-Shepherd / OmegaPRM](https://arxiv.org/abs/2312.08935) \| Process reward models from rollout-and-check \| Reward signal is teacher disagreement, not rollout outcomes. \|
	\| [Magpie](https://arxiv.org/abs/2406.08464) / OpenThoughts \| Synthetic data from one strong teacher \| Per-step distillation from N teachers on real traces. \|
	\| [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) (Wang et al.) \| Multi-teacher response-level aggregation \| Per-step (sub-response) aggregation. \|
	\| Composer SDPO / OPSD \| Single-model self-teacher with hint context \| TR-DPO uses N external teachers; complementary, not competing. \|

	To our knowledge no published work systematically replays each step of frozen agentic traces with multiple external teachers to harvest step-level supervision. The trace-replay-with-N-teachers construction appears to be open territory.

	### 4.3 Stacking with SDPO

	\| Property \| Composer SDPO \| TR-DPO (this work) \|
	\|---\|---\|---\|
	\| Number of models \| 1 \| N + 1 \|
	\| Teacher source \| Same model with privileged context \| External pretrained models \|
	\| Per-step compute \| One extra forward pass \| None (precomputed) \|
	\| Per-step API cost \| Zero \| ~$0.02 (3-teacher, ungated) \|
	\| Privileged signal \| Hint text in context \| None — teachers see same state \|
	\| Bypasses long-horizon credit assignment \| Yes (per-turn KL) \| Yes (per-step DPO) \|
	\| Published code \| Yes — `siyan-zhao/OPSD` \| Not yet \|

	Both add dense per-step signal on top of RLVR. Their gradient contributions are independent because they update via separate loss terms with separate weights.

	## 5. Empirical results so far

	### 5.1 Spike 001 — teacher-replay cost floor (✅ VALIDATED)

	Question: Given a 50-step agentic-coding trace, what's the API cost and wallclock latency of querying N=3 frontier teachers in parallel for next-action distributions at every step?

	Method. We synthesized 50 hand-crafted SWE-bench-lite-shaped agentic states (multi-turn function-call decision points; ~250–500 tokens of context each). For each state we issued parallel async requests to three teachers via OpenRouter:

	- `anthropic/claude-opus-4.7` (Anthropic frontier)
	- `openai/gpt-5` (OpenAI frontier)
	- `deepseek/deepseek-v4-pro` (open-weight frontier)

	Total: 150 calls. Hard-cap at $20 spend (early abort on overrun).

	Result.

	\| Metric \| Target \| Actual \| Pass? \|
	\|---\|---\|---\|---\|
	\| Mean per-trace cost (50 steps × 3 teachers, ungated) \| < $5 \| $0.98 \| ✅ 5× headroom \|
	\| p95 step latency (max across 3 parallel teachers) \| < 30 s \| 20.45 s \| ✅ \|
	\| p99 step latency \| < 60 s \| 23.24 s \| ✅ \|
	\| Errors \| 0 expected \| 0 / 150 \| ✅ \|

	Per-teacher breakdown.

	\| Teacher \| n \| p50 lat \| p95 lat \| mean $/call \| total $ \|
	\|---\|---\|---\|---\|---\|---\|
	\| `anthropic/claude-opus-4.7` \| 50 \| 3.4 s \| 4.6 s \| $0.0161 \| $0.81 \|
	\| `openai/gpt-5` \| 50 \| 5.0 s \| 10.1 s \| $0.0021 \| $0.11 \|
	\| `deepseek/deepseek-v4-pro` \| 50 \| 7.1 s \| 16.2 s \| $0.0013 \| $0.07 \|

	Opus dominates per-trace cost (~83%). With v0.1 VOI gating (only query teachers when student entropy is high; typically 60–80% reduction), projected per-trace cost falls to ~$0.30. Opus could optionally be dropped or replaced with Sonnet 4.6 for further savings.

	The economic floor is well within budget. Channel 3 is viable.

	Full data at [`spikes/001-teacher-replay-cost/`](../spikes/001-teacher-replay-cost/) (synthesize_trace.py + replay.py + analyze.py + verdict.md). Code is reproducible — set `OPENROUTER_API_KEY` and run `python synthesize_trace.py && python replay.py && python analyze.py`.

	### 5.2 Spike 005 — integration architecture (✅ COMPOSITION-VERIFIED)

	Question: Does the proposed three-channel integration (RLVR + SDPO + TR-DPO) compose cleanly in a real PyTorch trainer? Specifically: are the loss terms additive without unwanted interactions; do α/β=0 ablations correctly recover plain GRPO; and does a multi-channel run actually decrease loss on a tiny model?

	Method. A code skeleton at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) implements:

	- `opsd_loss.py` — `generalized_jsd_loss` lifted verbatim from `siyan-zhao/OPSD` (MIT). Self-contained static method computing JSD/KL between student and teacher logits.
	- `teacher_replay.py` — parallel OpenRouter client + DPO-pair extractor (from teacher–student disagreement at the agreement threshold).
	- `hint_generator.py` — template-based hint dispatcher keyed by error_kind (v0.1 starter; LLM-driven hints in v0.2).
	- `trl_path/data_collator.py` — `ComposerDataCollator` transforming raw trace + DPO pairs into the trainer batch dict.
	- `trl_path/composer_trainer.py` — `ComposerReplicationTrainer(GRPOTrainer)` with `_compute_loss` override.
	- `verl_path/composer_adv.py` — `@register_adv_est("grpo_composer")` for VeRL.

	Result: 38/38 unit tests pass in 3.43 s (`python3 -m pytest tests/ -v`).

	\| Test module \| Tests \| Status \|
	\|---\|---\|---\|
	\| `test_opsd_loss.py` (lifted OPSD math) \| 9 \| ✅ all pass \|
	\| `test_teacher_replay.py` (DPO-pair extraction) \| 7 \| ✅ all pass \|
	\| `test_data_collator.py` (raw trace → batch) \| 15 \| ✅ all pass \|
	\| `test_loss_composition_smoke.py` (3-channel + ablation + 5-step train) \| 7 \| ✅ all pass \|

	The composition smoke test runs all three channels on a `TinyLM(vocab=64, hidden=32)` (~10K parameters). Verifies:

	- Ablation invariants: `α=0, β=0` reduces exactly to GRPO; α-only adds SDPO; β-only adds DPO; full = sum.
	- Gradient finiteness: Every model parameter receives a finite gradient with all three channels active.
	- Training behavior: A 5-step train run with all three channels active decreases total loss (overfitting check on a fixed batch). The channels do not actively fight each other.
	- Robustness to mixed batches: When the data collator emits no SDPO inputs (no error sites in batch), the loss correctly bypasses the SDPO term even with α=1.

	This empirically tests the integration claim that was purely architectural in earlier draft material.

	### 5.3 What spike 005 does NOT prove

	The smoke test is on a 10K-parameter model with placeholder GRPO loss (cross-entropy on a synthetic target sequence) instead of the real GRPO advantage / group-relative computation. It demonstrates wiring correctness, not training quality. The real training experiments are in spikes 002–004.

	## 6. Implementation

	### 6.1 Lifted SDPO loss

	Verbatim port from `siyan-zhao/OPSD` (DeepWiki-verified self-contained, MIT licensed):

	```python
	def generalized_jsd_loss(
	student_logits: torch.Tensor, # (B, T, V)
	teacher_logits: torch.Tensor, # (B, T, V) — same model, hint-conditioned context
	labels: torch.Tensor \| None = None, # -100 = ignore
	beta: float = 0.5, # 0=fwd KL, 1=rev KL, 0.5=JSD
	temperature: float = 1.0,
	reduction: str = "batchmean",
	top_k: int \| None = None,
	token_clip: float \| None = None,
	) -> torch.Tensor: ...
	```

	Full implementation at [`spikes/005-integrated-trainer-skeleton/opsd_loss.py`](../spikes/005-integrated-trainer-skeleton/opsd_loss.py).

	### 6.2 TRL trainer subclass

	```python
	class ComposerReplicationTrainer(GRPOTrainer):
	def __init__(self, args, alpha_sdpo=0.1, beta_replay=0.05, *kwargs):
	super().__init__(args, *kwargs)
	self.alpha_sdpo = alpha_sdpo
	self.beta_replay = beta_replay

	def _compute_loss(self, model, inputs):
	grpo_loss = super()._compute_loss(model, inputs)
	sdpo_kl = self._compute_sdpo_loss(model, inputs)
	replay_dpo = self._compute_trace_replay_loss(model, inputs)
	return grpo_loss + self.alpha_sdpo * sdpo_kl + self.beta_replay * replay_dpo
	```

	Full implementation at [`spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`](../spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py).

	### 6.3 VeRL custom advantage estimator

	```python
	@register_adv_est("grpo_composer")
	def compute_grpo_composer_advantage(token_level_rewards, eos_mask, index, *,
	sdpo_teacher_logprobs=None, alpha_sdpo=0.0,
	teacher_consensus_prm=None, beta_replay=0.0,
	**kwargs):
	base_adv = core_algos.compute_grpo_outcome_advantage(token_level_rewards, eos_mask, index)
	if alpha_sdpo and sdpo_teacher_logprobs is not None:
	base_adv = base_adv + alpha_sdpo * (sdpo_teacher_logprobs - kwargs["old_log_prob"]) * kwargs["sdpo_error_mask"]
	if beta_replay and teacher_consensus_prm is not None:
	base_adv = base_adv + beta_replay * teacher_consensus_prm
	return base_adv
	```

	Full implementation at [`spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py`](../spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py).

	### 6.4 Data collator

	`ComposerDataCollator` consumes a list of `TraceExample` dicts (with optional `dpo_pairs` from `teacher_replay.extract_dpo_pairs`) and emits the batch dict the trainer expects:

	- Channel 1: `input_ids`, `attention_mask`, `response_mask`, `rewards`
	- Channel 2: `ctx_teacher_input_ids` (with hint inserted at error-turn boundary), `sdpo_loss_mask` (1 at post-hint tokens, -100 elsewhere)
	- Channel 3: `dpo_chosen_input_ids`, `dpo_rejected_input_ids`, `*_response_mask`

	Full implementation at [`spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py`](../spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py).

	## 7. What's NOT proven (and how the follow-up spikes will measure it)

	This paper is explicit about the gap between "framework + economic feasibility" and "this method works." The following claims are not yet validated:

	\| Claim \| Status \| Validating spike \|
	\|---\|---\|---\|
	\| TR-DPO improves SWE-bench-lite pass@1 over plain GRPO \| Open \| Spike 004: A/B Qwen3-7B trained with GRPO vs GRPO+TR-DPO; ≥2 pt pass@1 improvement at p<0.05 over 3 seeds is the success criterion. ~$300 GPU + $50 eval. \|
	\| Teacher disagreement at the step level carries non-trivial preference signal on real traces \| Open \| Spike 003: extract DPO pairs from spike-002 traces; report pairs/trace, KL distance from random pairs. \|
	\| TRL's GRPOTrainer with `environment_factory=` cleanly emits trace JSONL \| Open \| Spike 002a: 100 rollouts on Qwen3-7B + SWE-bench-lite via TRL. \|
	\| PRIME-RL produces equivalently clean trace export \| Open \| Spike 002b: head-to-head with 002a. \|
	\| The trained variant matches Composer 2.5 quality at 32B scale \| Open (v0.1) \| Out of scope for this paper. v0.1 follow-up. \|

	Honest framing for what this paper does and does not contribute:

	- ✅ Contributes: an integration architecture verified at the code level, with reusable components, public code, and economic feasibility for the novel channel. A reviewer can use this paper to design and budget a real experiment.
	- ❌ Does not contribute: evidence that the method actually trains better models. That requires the GPU spike. We are deliberately not making that claim until experiments back it.

	## 8. Reward-hacking safeguards (proposed for v0.1)

	Cursor's blog mentions specific reward-hacking failures (Java bytecode decompilation, Python type-cache reverse-engineering) without disclosing mitigations. We propose:

	- Sandbox hardening — disable `find`, `unzip`, `strings`, `objdump`, and similar introspection tools in the OpenEnv container; clear `__pycache__` between rollouts; `PYTHONHASHSEED` randomized.
	- Static-analysis monitor — flag rollouts that read or write paths matching `__pycache__/`, `.pyc`, `*.class` files, or unzip operations.
	- Reward-model penalty — train a small RM on annotated reward-hacking examples; subtract its score from the RLVR signal.

	These are pre-experiment proposals; their efficacy is part of the v0.1 follow-up.

	## 9. Limitations

	- Single-snapshot research. All upstream literature surveyed on 2026-05-25. The ecosystem moves fast: Forge may un-pause; OpenEnv may fork; PRIME-RL may consolidate. Re-survey every 6 months.
	- No primary access to Cursor's pipeline. All Composer 2.5 details come from the public blog. Critical gaps (hint generator architecture, exact RL algorithm) remain.
	- Trace-replay novelty claim is weak in negation. Absence of evidence in the surveyed literature is not strong evidence of absence. We may have missed adjacent work (e.g., on the "rich-feedback RLVR" axis the SDPO paper introduced).
	- Economic feasibility ≠ training-improvement evidence. Spike 001 establishes the method is affordable. Whether it produces better models is the spike-004 question.
	- Pre-experimental publication risks. Releasing a methodology before experimental validation has known failure modes (other groups racing the experiment; methodology errors caught only after publication). We accept this risk in exchange for early-feedback signal from the community.

	## 10. Conclusion

	We presented a methodology and integration architecture for replicating Cursor's Composer 2.5 recipe on an open-source stack, plus a novel multi-teacher trace-replay distillation channel. We grounded the Composer-2.5 mechanism in published prior art (OPSD/SDPO with MIT-licensed code), audited integration extension points across TRL/VeRL/OpenEnv with primary-source verification, and empirically validated the economic floor of the novel channel ($0.98/trace, 5× cap headroom) plus the architectural composition claim (38/38 unit tests, 5-step train decreases loss).

	The full empirical validation — does TR-DPO actually improve SWE-bench-lite pass@1 over plain GRPO at 7B? — is the subject of the follow-up paper.

	We invite collaboration. The repository is public. The integration architecture is component-friendly: any of the three channels can be ablated or substituted independently. If you're working on agentic-coding RL post-training and want to test additional channels (e.g., a fourth using Mixture-of-Agents-style aggregation, or test-time compute), the framework slots that in by adding another loss term with its own weight.

	## Acknowledgements

	This work builds on:

	- Cursor team for the Composer 2.5 release and its three named training innovations.
	- Siyan Zhao et al. for OPSD and the open `siyan-zhao/OPSD` reference implementation (MIT).
	- Hübotter et al. for SDPO, which formalizes the Cursor mechanism.
	- HuggingFace TRL team for `GRPOTrainer` and the OpenEnv integration.
	- ByteDance VeRL team for the HybridFlow / 3D-HybridEngine architecture.
	- Meta PyTorch team for Monarch + OpenEnv.
	- Prime Intellect team for PRIME-RL and the INTELLECT-2 decentralized run.
	- The five LLM-family research notes in [`research/`](../research/) (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Claude Sonnet 4.6, Kimi K2-Thinking) for cross-family verification of the framework synthesis.

	## Citation

	If you build on this framework, cite as:

	```bibtex
	@misc{composer-replication-framework-2026,
	author = {Codeseys},
	title = {Composer 2.5 Replication Framework: Methodology and Integration Architecture for Open Replication of Cursor's Agentic Coding Recipe},
	year = {2026},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}},
	note = {Pre-experimental v0.0 draft. Methodology, integration architecture, and economic-feasibility result. Empirical validation in follow-up paper.}
	}
	```

	Underlying primary sources:

	```bibtex
	@article{cursor2026composer25,
	title = {Introducing Composer 2.5},
	author = {{Cursor Team}},
	year = {2026},
	url = {https://cursor.com/blog/composer-2-5}
	}

	@article{zhao2026opsd,
	title = {Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models},
	author = {Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya},
	year = {2026},
	journal = {arXiv preprint arXiv:2601.18734}
	}

	@article{hubotter2026sdpo,
	title = {Reinforcement Learning via Self-Distillation},
	author = {H{\"u}botter, Jonas and L{\"u}beck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Buening, Thomas Kleine and Guestrin, Carlos and Krause, Andreas},
	year = {2026},
	journal = {arXiv preprint arXiv:2601.20802},
	note = {ICLR 2026 Scaling Post-training Workshop}
	}
	```

	## Repository

	Methodology document, audited research notes, integration architecture, working code skeleton, spike plan, and all supporting artifacts:

	🤗 https://huggingface.co/Codeseys/composer-replication-framework

	Discussion: open a [Discussion](https://huggingface.co/Codeseys/composer-replication-framework/discussions) on the repo for technical questions, corrections, or collaboration interest.

	---

	Last revised 2026-05-25 (Wave 4: data collator + loss composition smoke).
	This is a living document. v0.1 will incorporate spike 002–004 results.