baladithyab

Wave 5: full publication-materials drafts (pre-experimental release set)

639a760 12 days ago

33.6 kB

Composer 2.5 Replication Framework: A Methodology Paper

🚧 PRE-EXPERIMENTAL DRAFT (v0.0) — methodology + economic-feasibility result + integration architecture. No model training experiments yet. Every empirical claim in this paper is one of: (1) a citation to upstream literature, (2) a result from spike 001 (teacher-replay cost measurement), or (3) a unit-test invariant from spike 005 (integration smoke). The full empirical validation (spike 002–004: trace collection, DPO-pair signal density, A/B vs plain GRPO) is the subject of a follow-up paper once GPU budget commits.

Last updated: 2026-05-25 Author: Codeseys (HF) Repository: Codeseys/composer-replication-framework License: MIT (methodology + code); upstream papers and code retain their respective licenses

Abstract

Cursor's Composer 2.5 is a post-trained Kimi K2.5 that achieves frontier agentic-coding performance (~~69% Terminal-Bench 2.0, parity with GPT-5.5) at 5–10× lower serving cost than peers. The recipe is dominated (~~85%) by post-training, and its central non-obvious technique — Targeted RL with Textual Feedback — turns out to be mathematically equivalent to the published method SDPO / OPSD (Hübotter et al. 2026; Zhao et al. 2026), which Cursor's own footnote cites and for which MIT-licensed reference code already exists.

Building on this, we propose a complementary novel reward channel: multi-teacher trace-replay distillation (TR-DPO). After a frozen agentic rollout, replay each step under N pre-trained external teachers (e.g., Claude Opus, GPT-5, DeepSeek V4 Pro), extract DPO preference pairs from teacher–student disagreement, and add the resulting loss term on top of both RLVR and SDPO. The three channels do not compete for shared resources and ablate cleanly via independent weights.

We make three pre-experimental contributions: (1) an audited mapping of Cursor's Composer 2.5 blog onto a stack of open infrastructure (TRL / VeRL / OpenEnv / Monarch) with primary-source-verified extension points; (2) an empirical economic feasibility result showing the trace-replay channel costs ~$0.98 per 50-step trace ungated (spike 001, n=150 calls, 0 errors); (3) a working code skeleton with 38 passing unit tests, including an end-to-end gradient-step smoke test on a tiny custom model that empirically verifies the three-channel composition trains without divergence.

This paper deliberately stops short of training-result claims. Section 7 is explicit about what's not yet validated and what each follow-up spike will measure.

1. Introduction and motivation

1.1 What Composer 2.5 demonstrates

In their May 2026 release post, Cursor announced Composer 2.5, a post-trained version of Moonshot's Kimi K2.5 that powers Cursor's agentic coding mode. Their public claims:

Substantial improvement over Composer 2 on long-horizon coding tasks
Frontier-level performance: parity with GPT-5.5 on SWE-bench Multilingual; ~69% on Terminal-Bench 2.0
Pricing of $0.50/$2.50 per million input/output tokens — 5–10× cheaper than Opus 4.6 ($5/$25) and GPT-5.4 ($5/$22.50)
Dominant compute share goes to post-training, not pretraining (community estimate: ~85%)

The blog discloses three training innovations (Section 2 expands each):

Targeted RL with Textual Feedback — a per-turn distillation loss that addresses long-horizon credit assignment.
Synthetic data at 25× scale — Feature Deletion + 24 other (unnamed) generators.
Sharded Muon + Dual Mesh HSDP — MoE optimizer infrastructure.

If a small team can reproduce the shape of (1) and (2) without K2.5's 1T scale, the path is open to similar performance on smaller bases. Item (3) is MoE-specific infrastructure and irrelevant for dense-base reproductions.

1.2 What's missing from the public recipe

The blog leaves three significant gaps:

How are hints generated? The blog gives one tool-call template ("Reminder: available tools are…") but says nothing about the generator architecture. Hardcoded templates? A separate model? The same model with an introspection prompt? This is the single largest reproducibility gap.
What RL algorithm sits underneath? The targeted-textual-feedback method is described as "an on-policy distillation KL loss [added to] the broader RL objective over the full trajectory." The "broader RL objective" is unspecified.
What reward-hacking safeguards work in practice? Cursor explicitly mentions failure modes (decompiling Java bytecode, reverse-engineering Python type-checking caches) without disclosing the mitigations beyond "agentic monitoring tools."

1.3 What we propose

This paper makes a methodological contribution: an open-source replication framework that integrates three reward channels in a single trainer step, on top of any HuggingFace base model, using off-the-shelf open-source infrastructure (TRL or VeRL + OpenEnv).

The three channels:

RLVR (Channel 1) — verifiable scalar reward (tests pass, build succeeds). Standard.
Composer hint-distill = SDPO (Channel 2) — single-model self-distillation with hint-conditioned context, lifted from siyan-zhao/OPSD (MIT). This is Cursor's published method.
Trace-replay multi-teacher DPO (Channel 3, novel) — N external teachers replay each step of frozen rollouts; teacher–student disagreement becomes a DPO preference signal.

Channels 2 and 3 are mechanistically different, not competing. Both bypass long-horizon credit assignment, but they tap different supervision sources. Section 4 makes the distinction precise.

We provide:

A primary-source-audited integration architecture across the agentic-RL stack (Section 3, docs/INTEGRATION_ARCHITECTURE.md)
A working code skeleton with 38 passing unit tests (Section 6, spikes/005-integrated-trainer-skeleton/)
An empirical economic feasibility verdict for Channel 3 (Section 5, spikes/001-teacher-replay-cost/verdict.md)
A risk-ordered spike plan whose terminal experiment falsifies or validates the novel claim (Section 7, spikes/README.md)

We do not yet provide training results. That is deferred to a follow-up paper after spike 002 (trace collection on a real agentic environment), spike 003 (DPO-pair signal density on real traces), and spike 004 (A/B comparison on SWE-bench-lite).

2. Composer 2.5 in technical detail (audited)

This section reconstructs Cursor's published recipe with explicit [BLOG-VERIFIED] / [INFERRED] / [EXTRAPOLATED] tags, following the audit in docs/COMPOSER_RECIPE_MAPPING.md. The distinction matters because the initial parallel-research synthesis for this project blurred which claims came from the blog versus secondary sources, and a primary-source audit caught several extrapolations.

2.1 Targeted RL with Textual Feedback `[BLOG-VERIFIED]`

"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog

The mechanism, exactly:

Same model acts as both teacher and student. There is no second model.
The teacher is "the policy at this turn, with a hint inserted into the context."
The student is "the policy at this turn, without the hint."
Loss = on-policy KL: KL(teacher_logits_at_turn_t || student_logits_at_turn_t), applied only at the problematic turn.
Sits on top of an outer RLVR objective; doesn't replace it.

Cursor's footnote 1 cites three self-distillation papers as background:

OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs (Zhao et al., arXiv:2601.18734; code at siyan-zhao/OPSD, MIT). Single LLM as both teacher and student; teacher conditioned on privileged information (e.g., a verified solution), student sees only the question.
SDPO: Reinforcement Learning via Self-Distillation (Hübotter et al., arXiv:2601.20802, ICLR 2026 Workshop on Scaling Post-training). Generalizes OPSD to RL with rich textual feedback. Quote from the abstract: "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy." This is mathematically the same construct as Composer's targeted-textual-feedback.
Self-Distillation Enables Continual Learning (arXiv:2601.19897).

The OPSD reference implementation provides a self-contained generalized_jsd_loss static method (signature in Section 6) computing JSD/KL between student and teacher logits — directly liftable into any HF Trainer subclass.

2.2 Synthetic data at 25× scale `[BLOG-VERIFIED]`

"Composer 2.5 is trained with 25× more synthetic tasks than Composer 2." — Cursor blog

One named generator: Feature Deletion. Take a repo with comprehensive tests; delete code such that the codebase remains functional but specific testable features are removed. The agent's task is to reimplement the deleted features so the tests pass. Tests = verifiable reward.

The blog also discloses observed reward-hacking failures: the model learning to decompile Java bytecode to reconstruct deleted APIs, and reverse-engineering Python type-checking caches to recover deleted function signatures. Mitigations are described only as "agentic monitoring tools" — opaque.

2.3 Sharded Muon + Dual Mesh HSDP `[BLOG-VERIFIED]`

MoE optimizer infrastructure. Per-attention-head and per-expert orthogonalization (Newton-Schulz) with asynchronous all-to-all communication. Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.

This is infrastructure, not algorithm. Relevant only at MoE-1T scale (Kimi K2.5). For the v0.1 dense-base reproductions described here (Qwen3-7B in v0.0 spike, Qwen3-32B in v0.1), it is irrelevant. Becomes relevant if the framework is later applied to a Kimi-K2.5-derivative directly.

2.4 Claims that are NOT in the Cursor blog (extrapolated)

These claims appear in community commentary and were reproduced uncritically in the initial synthesis for this project. They are likely correct via secondary sources but are not Cursor-stated:

"~85% of total compute is post-training" — community consensus (HN threads, third-party substack analysis). Plausible but unverified.
"Anyrun" environment harness with LSP / file I/O / terminal — name "Anyrun" is not in the 2.5 blog (may be in the Composer 2 technical report).
CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual — the 2.5 blog does not quote benchmark numbers.
"PPO or GRPO variant" as the underlying RL algorithm — the blog never names the RL algorithm.
MLA + 1T total / 32B active + 384 experts + 256K context — these are Kimi K2.5 base model facts, verified independently but inferred rather than blog-stated.

The blog is unambiguous on the three items in §2.1–§2.3 and otherwise terse.

3. Integration architecture across the agentic-RL stack

The three reward channels need to compose inside a single trainer step, regardless of which RL framework hosts them. We audited (via DeepWiki on 2026-05-25) the extension points of the major open-source agentic-RL frameworks and produced an integration matrix.

3.1 Frameworks surveyed

Framework	Role	Status
HuggingFace TRL	Reference algorithm library; `GRPOTrainer` is the workhorse for RLVR	Mature (v1.0, 2026-03); developer-friendly; OpenEnv integration since 2025-10
ByteDance VeRL	Production-scale RL via HybridFlow + 3D-HybridEngine; Ray-based	Proven 671B; preferred for ≥70B runs
Meta TorchForge	RL post-training on Monarch; reference recipes	"Development paused — consolidating into TorchTitan" (banner). Use as pattern reference only.
Meta Monarch	Single-controller actor-mesh framework with RDMA data plane	Active; native PyTorch integration
Meta OpenEnv	Standard for agentic environments (typed reset/step/close + MCP RFC)	First-party TRL integration; HF Hub catalog; growing community
Prime Intellect PRIME-RL	Decentralized RL substrate (INTELLECT-2, 32B globally distributed)	Production-deployed; `verifiers` env library

3.2 Extension-point matrix (verified)

The full table is in docs/INTEGRATION_ARCHITECTURE.md. Summary:

Channel	TRL	VeRL
1. RLVR/GRPO	`GRPOTrainer._compute_loss(model, inputs)` (base behavior)	`@register_adv_est("grpo")` → `core_algos.compute_grpo_outcome_advantage`
2. SDPO	Subclass override of `_compute_loss`; lift `generalized_jsd_loss` from OPSD	New `@register_adv_est("grpo_sdpo")`; reads `data.batch["sdpo_teacher_logprobs"]`; precedent: distillation already attaches `teacher_log_probs`
3. TR-DPO	Subclass override; add DPO term using `inputs["dpo_chosen_input_ids"]`, etc.	Custom estimator reading `data.non_tensor_batch["teacher_actions"]`
OpenEnv plumbing	`environment_factory=` kwarg in trainer init	Custom env worker producing `DataProto`-shaped output

Both paths allow the integrated trainer to be assembled out of small components — none of the three channels requires modifying the framework's core. The full unified loss is:

total_loss = grpo_loss + α · sdpo_kl_loss + β · trace_replay_dpo_loss

where α = β = 0 recovers plain GRPO (the baseline arm of any ablation).

3.3 Why this matters

The architectural constraint that lets all three channels co-exist: they don't compete for shared resources.

Resource	Channel 1 (RLVR)	Channel 2 (SDPO)	Channel 3 (TR-DPO)
GPU forward pass (rollout)	yes (vLLM, async)	no	no
GPU forward pass (training)	yes	one extra per error site (~5% of tokens)	none — uses precomputed logprobs
GPU backward pass	yes	yes	yes
External API budget	none	none	$0.30–1 per 50-step trace
Latency-critical path	yes — gates next rollout	minor	no — async, post-rollout

Channel 2 is forward-pass-bound (training-side, sparse). Channel 3 is API-bound (offline, post-rollout). They don't fight for the same compute.

4. The novel channel: multi-teacher trace-replay distillation

4.1 Construction

After a frozen agentic rollout (state_t, action_t, reward), for each step t:

Replay the exact state at step t against N pre-trained external teachers (different model families).
Extract each teacher's chosen action (or action distribution) at that state.
Score the disagreement: if k ≥ k_threshold of N teachers agree on action X and the student picked Y ≠ X, emit a DPO preference pair (chosen=X, rejected=Y).
Train with standard DPO loss on the pair set, layered onto GRPO + (optionally) SDPO.

This is offline relative to the rollout — teacher API calls happen post-rollout, after which the DPO pairs are batched into the next training step. No latency-critical coupling.

4.2 Distinction from related work

Work	Mechanism	Difference from TR-DPO
rStar / rStar-Math (Microsoft)	MCTS at training time, single teacher branches at each step	TR-DPO replays pre-existing traces (not MCTS); uses N teachers (not single).
Math-Shepherd / OmegaPRM	Process reward models from rollout-and-check	Reward signal is teacher disagreement, not rollout outcomes.
Magpie / OpenThoughts	Synthetic data from one strong teacher	Per-step distillation from N teachers on real traces.
Mixture-of-Agents (Wang et al.)	Multi-teacher response-level aggregation	Per-step (sub-response) aggregation.
Composer SDPO / OPSD	Single-model self-teacher with hint context	TR-DPO uses N external teachers; complementary, not competing.

To our knowledge no published work systematically replays each step of frozen agentic traces with multiple external teachers to harvest step-level supervision. The trace-replay-with-N-teachers construction appears to be open territory.

4.3 Stacking with SDPO

Property	Composer SDPO	TR-DPO (this work)
Number of models	1	N + 1
Teacher source	Same model with privileged context	External pretrained models
Per-step compute	One extra forward pass	None (precomputed)
Per-step API cost	Zero	~$0.02 (3-teacher, ungated)
Privileged signal	Hint text in context	None — teachers see same state
Bypasses long-horizon credit assignment	Yes (per-turn KL)	Yes (per-step DPO)
Published code	Yes — `siyan-zhao/OPSD`	Not yet

Both add dense per-step signal on top of RLVR. Their gradient contributions are independent because they update via separate loss terms with separate weights.

5. Empirical results so far

5.1 Spike 001 — teacher-replay cost floor (✅ VALIDATED)

Question: Given a 50-step agentic-coding trace, what's the API cost and wallclock latency of querying N=3 frontier teachers in parallel for next-action distributions at every step?

Method. We synthesized 50 hand-crafted SWE-bench-lite-shaped agentic states (multi-turn function-call decision points; ~250–500 tokens of context each). For each state we issued parallel async requests to three teachers via OpenRouter:

anthropic/claude-opus-4.7 (Anthropic frontier)
openai/gpt-5 (OpenAI frontier)
deepseek/deepseek-v4-pro (open-weight frontier)

Total: 150 calls. Hard-cap at $20 spend (early abort on overrun).

Result.

Metric	Target	Actual	Pass?
Mean per-trace cost (50 steps × 3 teachers, ungated)	< $5	$0.98	✅ 5× headroom
p95 step latency (max across 3 parallel teachers)	< 30 s	20.45 s	✅
p99 step latency	< 60 s	23.24 s	✅
Errors	0 expected	0 / 150	✅

Per-teacher breakdown.

Teacher	n	p50 lat	p95 lat	mean $/call	total $
`anthropic/claude-opus-4.7`	50	3.4 s	4.6 s	$0.0161	$0.81
`openai/gpt-5`	50	5.0 s	10.1 s	$0.0021	$0.11
`deepseek/deepseek-v4-pro`	50	7.1 s	16.2 s	$0.0013	$0.07

Opus dominates per-trace cost (~83%). With v0.1 VOI gating (only query teachers when student entropy is high; typically 60–80% reduction), projected per-trace cost falls to ~$0.30. Opus could optionally be dropped or replaced with Sonnet 4.6 for further savings.

The economic floor is well within budget. Channel 3 is viable.

Full data at spikes/001-teacher-replay-cost/ (synthesize_trace.py + replay.py + analyze.py + verdict.md). Code is reproducible — set OPENROUTER_API_KEY and run python synthesize_trace.py && python replay.py && python analyze.py.

5.2 Spike 005 — integration architecture (✅ COMPOSITION-VERIFIED)

Question: Does the proposed three-channel integration (RLVR + SDPO + TR-DPO) compose cleanly in a real PyTorch trainer? Specifically: are the loss terms additive without unwanted interactions; do α/β=0 ablations correctly recover plain GRPO; and does a multi-channel run actually decrease loss on a tiny model?

Method. A code skeleton at spikes/005-integrated-trainer-skeleton/ implements:

opsd_loss.py — generalized_jsd_loss lifted verbatim from siyan-zhao/OPSD (MIT). Self-contained static method computing JSD/KL between student and teacher logits.
teacher_replay.py — parallel OpenRouter client + DPO-pair extractor (from teacher–student disagreement at the agreement threshold).
hint_generator.py — template-based hint dispatcher keyed by error_kind (v0.1 starter; LLM-driven hints in v0.2).
trl_path/data_collator.py — ComposerDataCollator transforming raw trace + DPO pairs into the trainer batch dict.
trl_path/composer_trainer.py — ComposerReplicationTrainer(GRPOTrainer) with _compute_loss override.
verl_path/composer_adv.py — @register_adv_est("grpo_composer") for VeRL.

Result: 38/38 unit tests pass in 3.43 s (python3 -m pytest tests/ -v).

Test module	Tests	Status
`test_opsd_loss.py` (lifted OPSD math)	9	✅ all pass
`test_teacher_replay.py` (DPO-pair extraction)	7	✅ all pass
`test_data_collator.py` (raw trace → batch)	15	✅ all pass
`test_loss_composition_smoke.py` (3-channel + ablation + 5-step train)	7	✅ all pass

The composition smoke test runs all three channels on a TinyLM(vocab=64, hidden=32) (~10K parameters). Verifies:

Ablation invariants: α=0, β=0 reduces exactly to GRPO; α-only adds SDPO; β-only adds DPO; full = sum.
Gradient finiteness: Every model parameter receives a finite gradient with all three channels active.
Training behavior: A 5-step train run with all three channels active decreases total loss (overfitting check on a fixed batch). The channels do not actively fight each other.
Robustness to mixed batches: When the data collator emits no SDPO inputs (no error sites in batch), the loss correctly bypasses the SDPO term even with α=1.

This empirically tests the integration claim that was purely architectural in earlier draft material.

5.3 What spike 005 does NOT prove

The smoke test is on a 10K-parameter model with placeholder GRPO loss (cross-entropy on a synthetic target sequence) instead of the real GRPO advantage / group-relative computation. It demonstrates wiring correctness, not training quality. The real training experiments are in spikes 002–004.

6. Implementation

6.1 Lifted SDPO loss

Verbatim port from siyan-zhao/OPSD (DeepWiki-verified self-contained, MIT licensed):

def generalized_jsd_loss(
    student_logits: torch.Tensor,    # (B, T, V)
    teacher_logits: torch.Tensor,    # (B, T, V) — same model, hint-conditioned context
    labels: torch.Tensor | None = None,        # -100 = ignore
    beta: float = 0.5,                          # 0=fwd KL, 1=rev KL, 0.5=JSD
    temperature: float = 1.0,
    reduction: str = "batchmean",
    top_k: int | None = None,
    token_clip: float | None = None,
) -> torch.Tensor: ...

Full implementation at spikes/005-integrated-trainer-skeleton/opsd_loss.py.

6.2 TRL trainer subclass

class ComposerReplicationTrainer(GRPOTrainer):
    def __init__(self, *args, alpha_sdpo=0.1, beta_replay=0.05, **kwargs):
        super().__init__(*args, **kwargs)
        self.alpha_sdpo = alpha_sdpo
        self.beta_replay = beta_replay

    def _compute_loss(self, model, inputs):
        grpo_loss   = super()._compute_loss(model, inputs)
        sdpo_kl     = self._compute_sdpo_loss(model, inputs)
        replay_dpo  = self._compute_trace_replay_loss(model, inputs)
        return grpo_loss + self.alpha_sdpo * sdpo_kl + self.beta_replay * replay_dpo

Full implementation at spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py.

6.3 VeRL custom advantage estimator

@register_adv_est("grpo_composer")
def compute_grpo_composer_advantage(token_level_rewards, eos_mask, index, *,
                                     sdpo_teacher_logprobs=None, alpha_sdpo=0.0,
                                     teacher_consensus_prm=None, beta_replay=0.0,
                                     **kwargs):
    base_adv = core_algos.compute_grpo_outcome_advantage(token_level_rewards, eos_mask, index)
    if alpha_sdpo and sdpo_teacher_logprobs is not None:
        base_adv = base_adv + alpha_sdpo * (sdpo_teacher_logprobs - kwargs["old_log_prob"]) * kwargs["sdpo_error_mask"]
    if beta_replay and teacher_consensus_prm is not None:
        base_adv = base_adv + beta_replay * teacher_consensus_prm
    return base_adv

Full implementation at spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py.

6.4 Data collator

ComposerDataCollator consumes a list of TraceExample dicts (with optional dpo_pairs from teacher_replay.extract_dpo_pairs) and emits the batch dict the trainer expects:

Channel 1: input_ids, attention_mask, response_mask, rewards
Channel 2: ctx_teacher_input_ids (with hint inserted at error-turn boundary), sdpo_loss_mask (1 at post-hint tokens, -100 elsewhere)
Channel 3: dpo_chosen_input_ids, dpo_rejected_input_ids, *_response_mask

Full implementation at spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py.

7. What's NOT proven (and how the follow-up spikes will measure it)

This paper is explicit about the gap between "framework + economic feasibility" and "this method works." The following claims are not yet validated:

Claim	Status	Validating spike
TR-DPO improves SWE-bench-lite pass@1 over plain GRPO	Open	Spike 004: A/B Qwen3-7B trained with GRPO vs GRPO+TR-DPO; ≥2 pt pass@1 improvement at p<0.05 over 3 seeds is the success criterion. ~$300 GPU + $50 eval.
Teacher disagreement at the step level carries non-trivial preference signal on real traces	Open	Spike 003: extract DPO pairs from spike-002 traces; report pairs/trace, KL distance from random pairs.
TRL's GRPOTrainer with `environment_factory=` cleanly emits trace JSONL	Open	Spike 002a: 100 rollouts on Qwen3-7B + SWE-bench-lite via TRL.
PRIME-RL produces equivalently clean trace export	Open	Spike 002b: head-to-head with 002a.
The trained variant matches Composer 2.5 quality at 32B scale	Open (v0.1)	Out of scope for this paper. v0.1 follow-up.

Honest framing for what this paper does and does not contribute:

✅ Contributes: an integration architecture verified at the code level, with reusable components, public code, and economic feasibility for the novel channel. A reviewer can use this paper to design and budget a real experiment.
❌ Does not contribute: evidence that the method actually trains better models. That requires the GPU spike. We are deliberately not making that claim until experiments back it.

8. Reward-hacking safeguards (proposed for v0.1)

Cursor's blog mentions specific reward-hacking failures (Java bytecode decompilation, Python type-cache reverse-engineering) without disclosing mitigations. We propose:

Sandbox hardening — disable find, unzip, strings, objdump, and similar introspection tools in the OpenEnv container; clear __pycache__ between rollouts; PYTHONHASHSEED randomized.
Static-analysis monitor — flag rollouts that read or write paths matching __pycache__/, .pyc, *.class files, or unzip operations.
Reward-model penalty — train a small RM on annotated reward-hacking examples; subtract its score from the RLVR signal.

These are pre-experiment proposals; their efficacy is part of the v0.1 follow-up.

9. Limitations

Single-snapshot research. All upstream literature surveyed on 2026-05-25. The ecosystem moves fast: Forge may un-pause; OpenEnv may fork; PRIME-RL may consolidate. Re-survey every 6 months.
No primary access to Cursor's pipeline. All Composer 2.5 details come from the public blog. Critical gaps (hint generator architecture, exact RL algorithm) remain.
Trace-replay novelty claim is weak in negation. Absence of evidence in the surveyed literature is not strong evidence of absence. We may have missed adjacent work (e.g., on the "rich-feedback RLVR" axis the SDPO paper introduced).
Economic feasibility ≠ training-improvement evidence. Spike 001 establishes the method is affordable. Whether it produces better models is the spike-004 question.
Pre-experimental publication risks. Releasing a methodology before experimental validation has known failure modes (other groups racing the experiment; methodology errors caught only after publication). We accept this risk in exchange for early-feedback signal from the community.

10. Conclusion

We presented a methodology and integration architecture for replicating Cursor's Composer 2.5 recipe on an open-source stack, plus a novel multi-teacher trace-replay distillation channel. We grounded the Composer-2.5 mechanism in published prior art (OPSD/SDPO with MIT-licensed code), audited integration extension points across TRL/VeRL/OpenEnv with primary-source verification, and empirically validated the economic floor of the novel channel ($0.98/trace, 5× cap headroom) plus the architectural composition claim (38/38 unit tests, 5-step train decreases loss).

The full empirical validation — does TR-DPO actually improve SWE-bench-lite pass@1 over plain GRPO at 7B? — is the subject of the follow-up paper.

We invite collaboration. The repository is public. The integration architecture is component-friendly: any of the three channels can be ablated or substituted independently. If you're working on agentic-coding RL post-training and want to test additional channels (e.g., a fourth using Mixture-of-Agents-style aggregation, or test-time compute), the framework slots that in by adding another loss term with its own weight.

Acknowledgements

This work builds on:

Cursor team for the Composer 2.5 release and its three named training innovations.
Siyan Zhao et al. for OPSD and the open siyan-zhao/OPSD reference implementation (MIT).
Hübotter et al. for SDPO, which formalizes the Cursor mechanism.
HuggingFace TRL team for GRPOTrainer and the OpenEnv integration.
ByteDance VeRL team for the HybridFlow / 3D-HybridEngine architecture.
Meta PyTorch team for Monarch + OpenEnv.
Prime Intellect team for PRIME-RL and the INTELLECT-2 decentralized run.
The five LLM-family research notes in research/ (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Claude Sonnet 4.6, Kimi K2-Thinking) for cross-family verification of the framework synthesis.

Citation

If you build on this framework, cite as:

@misc{composer-replication-framework-2026,
  author       = {Codeseys},
  title        = {Composer 2.5 Replication Framework: Methodology and Integration Architecture for Open Replication of Cursor's Agentic Coding Recipe},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}},
  note         = {Pre-experimental v0.0 draft. Methodology, integration architecture, and economic-feasibility result. Empirical validation in follow-up paper.}
}

Underlying primary sources:

@article{cursor2026composer25,
  title  = {Introducing Composer 2.5},
  author = {{Cursor Team}},
  year   = {2026},
  url    = {https://cursor.com/blog/composer-2-5}
}

@article{zhao2026opsd,
  title   = {Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models},
  author  = {Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya},
  year    = {2026},
  journal = {arXiv preprint arXiv:2601.18734}
}

@article{hubotter2026sdpo,
  title   = {Reinforcement Learning via Self-Distillation},
  author  = {H{\"u}botter, Jonas and L{\"u}beck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Buening, Thomas Kleine and Guestrin, Carlos and Krause, Andreas},
  year    = {2026},
  journal = {arXiv preprint arXiv:2601.20802},
  note    = {ICLR 2026 Scaling Post-training Workshop}
}

Repository

Methodology document, audited research notes, integration architecture, working code skeleton, spike plan, and all supporting artifacts:

🤗 https://huggingface.co/Codeseys/composer-replication-framework

Discussion: open a Discussion on the repo for technical questions, corrections, or collaboration interest.

Last revised 2026-05-25 (Wave 4: data collator + loss composition smoke). This is a living document. v0.1 will incorporate spike 002–004 results.