baladithyab

Wave 5: full publication-materials drafts (pre-experimental release set)

639a760 12 days ago

16.2 kB

title: >-
  Composer 2.5 from First Principles: An Open Replication Framework + a Novel
  Trace-Replay Distillation Channel
thumbnail: /blog/assets/composer-replication-framework/thumbnail.png
authors:
  - user: Codeseys

Composer 2.5 from First Principles: An Open Replication Framework + a Novel Trace-Replay Distillation Channel

TL;DR. Cursor's Composer 2.5 ships ~5–10× cheaper than peer frontier models with the bulk of its compute spent on RL post-training, not pretraining. Their headline trick — "Targeted RL with Textual Feedback" — turns out to be mathematically equivalent to the published method SDPO/OPSD with MIT-licensed reference code. Building on that, I'm releasing an open replication framework that integrates three reward channels in a single trainer step (RLVR + SDPO + a novel multi-teacher trace-replay channel) on top of HuggingFace TRL or ByteDance VeRL. This post is pre-experimental — the framework is in place, the integration is verified by 38 unit tests including a 5-step end-to-end gradient run, and the novel channel's economic feasibility is empirically validated at $0.98 per 50-step trace. Training results come in a follow-up after the GPU-bound spikes run.

What Composer 2.5 actually is, in one paragraph

Cursor announced Composer 2.5 in May 2026. It's a post-trained version of Moonshot's Kimi K2.5 (1T total / 32B active MoE), tuned specifically for agentic coding inside Cursor. The headline numbers: parity with GPT-5.5 on SWE-bench Multilingual, ~69% on Terminal-Bench 2.0, priced at $0.50/$2.50 per million input/output tokens — that's 5–10× cheaper to serve than Opus 4.6 or GPT-5.4. Most of the compute went into post-training, not pretraining. The blog discloses three training innovations and leaves three big reproducibility gaps.

If a small team can reproduce the shape of this recipe without K2.5's 1T scale — say, on Qwen3-7B or Qwen3-32B — that's a path to similar agentic coding capability on a smaller, open base. That's the project this post introduces.

The non-obvious move

Cursor's three named training innovations are:

Targeted RL with Textual Feedback — a per-turn distillation loss that addresses long-horizon credit assignment.
Synthetic data at 25× scale — including their "Feature Deletion" generator, where the agent has to reimplement deleted code to make tests pass.
Sharded Muon + Dual Mesh HSDP — MoE optimizer infrastructure (only relevant at K2.5 scale).

(2) and (3) are well-understood. (1) is the interesting move and it's worth quoting Cursor directly:

"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."

What's happening: when a 100K-token rollout has a localized error (wrong tool name, style violation, etc.), Cursor doesn't punish the whole trajectory — they generate a text hint correcting the error, run the same model with the hint inserted into the context to get "teacher" logits, run the model on the original context to get "student" logits, and apply per-turn KL divergence loss to pull the student toward the teacher only at that turn. The same model is both teacher and student — the teacher just has the hint inserted into its context.

This sidesteps the credit-assignment nightmare of long-horizon scalar rewards. It's the "make GRPO not poison 100 good steps for 1 bad step" trick.

Wait, this is a published method

Cursor's blog footnote 1 cites three self-distillation papers — and the most relevant one is SDPO: Reinforcement Learning via Self-Distillation (Hübotter et al., ICLR 2026 Workshop). The SDPO paper's abstract:

"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."

That's the Cursor mechanism, named and formalized. And the closely-related precursor — OPSD: Self-Distilled Reasoner (Zhao et al.) — has MIT-licensed reference code at siyan-zhao/OPSD.

The single-most-important takeaway from auditing the Composer 2.5 blog: the secret sauce isn't secret — it's published, it's named, and you can lift the loss function directly into your trainer. The SDPO paper's loss-comparison table puts it bluntly:

Method	Sampling	Signal	Feedback
SFT / Distillation	off-policy	rich	strong teacher
On-Policy Distillation (Agarwal 2024)	on-policy	rich	strong teacher
RLVR / GRPO	on-policy	weak	environment
SDPO (= Cursor's method)	on-policy	rich	environment

What Composer 2.5 actually shows is: SDPO works at production scale on agentic coding. That's an important empirical demonstration even if the algorithm itself is published.

A novel addition: multi-teacher trace-replay distillation (TR-DPO)

If SDPO uses one model as both teacher and student (with the teacher just having a hint inserted into its context), the obvious complementary question is: what if we use N different external pretrained models as teachers? Specifically:

After a frozen agentic rollout, replay each step against N teachers from different model families (Claude Opus, GPT-5, DeepSeek V4 Pro, ...).
At each step, extract each teacher's chosen action.
If most teachers agree on action X but the student picked Y, emit a DPO preference pair: chosen=X, rejected=Y.
Train with standard DPO loss on the pair set, layered on top of both RLVR and SDPO.

This isn't competing with SDPO — they're complementary. SDPO uses one model with privileged context; TR-DPO uses N models with no privileged context. The signal sources are different.

The closest published precedents are rStar (single-teacher MCTS counterfactuals), Math-Shepherd (process reward models from rollouts), and Mixture-of-Agents (response-level multi-model aggregation). To my knowledge, no published work systematically replays each step of frozen agentic traces with multiple external teachers to harvest step-level supervision. This appears to be open territory.

The obvious objection: "Won't N teacher API calls per step be expensive?" That's exactly what spike 001 measured.

Spike 001: economic feasibility of the trace-replay channel ($0.98/trace, 5× cap headroom)

Before paying GPU costs to test whether TR-DPO actually improves training, the kill-switch question is: can we afford the teacher API calls at all?

I synthesized 50 hand-crafted SWE-bench-lite-shaped agentic decision states (each ~250–500 tokens of context), and for every state ran parallel async requests to three frontier teachers via OpenRouter:

anthropic/claude-opus-4.7
openai/gpt-5
deepseek/deepseek-v4-pro

150 calls total. Hard-cap at $20.

Threshold	Target	Actual
Mean per-trace cost (50 steps × 3 teachers, ungated)	< $5	$0.98 ✅
p95 step latency (max across 3 parallel teachers)	< 30 s	20.5 s ✅
p99 step latency	< 60 s	23.2 s ✅
Errors	0	0 / 150 ✅

Per-teacher cost composition: Opus dominates at 83% of the spend ($0.81 / $0.98). The other two teachers are essentially free. With v0.1 VOI gating (only query teachers when student entropy is high), projected per-trace cost falls to ~$0.30. Drop Opus or swap in Sonnet 4.6 and you save another 50–70%.

Channel 3 is economically viable. The full code (synthesize_trace.py, replay.py, analyze.py, plus the 150-call result jsonl) is at spikes/001-teacher-replay-cost/ on the repo.

Spike 005: integration architecture verified by 38 passing tests

The three reward channels — RLVR, SDPO, TR-DPO — need to compose inside a single trainer step. The unified loss:

total_loss = grpo_loss + α · sdpo_kl_loss + β · trace_replay_dpo_loss

α=0, β=0 recovers plain GRPO. Any subset can be ablated by zeroing its weight.

For this to work in practice, you need clean extension points in whatever RL framework you're using. I DeepWiki-audited the major candidates:

Framework	Channel 1 (RLVR)	Channel 2 (SDPO)	Channel 3 (TR-DPO)
TRL	`GRPOTrainer._compute_loss`	Subclass override; lift `generalized_jsd_loss` from OPSD	Subclass override; DPO term using teacher-disagreement pairs
VeRL	`@register_adv_est("grpo")`	New estimator; reads `data.batch["sdpo_teacher_logprobs"]`	Custom estimator reading `data.non_tensor_batch["teacher_actions"]`

TRL is the right v0.0/v0.1 choice (simplest extension, great OpenEnv integration, fits 7B–32B comfortably). VeRL is the right v0.2 choice when scaling to 70B+ on a Ray cluster.

The crucial property: the three channels don't compete for shared resources. Channel 2 is a sparse extra forward pass (training-side, ~5% of tokens at error sites). Channel 3 is offline post-rollout API calls. They don't fight for the same compute.

I built a working code skeleton at spikes/005-integrated-trainer-skeleton/ implementing both paths. Test results:

$ python3 -m pytest tests/ -v
============================== 38 passed in 3.43s ==============================

Test module	Tests	What it proves
`test_opsd_loss.py`	9	Lifted SDPO loss is differentiable, equal-zero on identical distributions, all β values, masking + clipping correct
`test_teacher_replay.py`	7	DPO-pair extraction logic: consensus + threshold + error-call exclusion
`test_data_collator.py`	15	Raw trace → batch dict; hint injection + post-hint masking + DPO tokenization
`test_loss_composition_smoke.py`	7	All three channels compose; α/β=0 ablations recover GRPO; 5-step train decreases loss

The composition smoke test runs all three channels on a 10K-parameter TinyLM. The integration claim — all three channels run simultaneously, ablate cleanly, train without divergence — is now an empirically tested invariant rather than a paper diagram.

What I'm NOT claiming

This post is pre-experimental. The full empirical validation — does TR-DPO actually improve SWE-bench-lite pass@1 over plain GRPO at 7B? — is the subject of a follow-up paper after the GPU-bound spikes run:

Spike	What it measures	Status
002a / 002b	Trace collection on Qwen3-7B + SWE-bench-lite via TRL vs PRIME-RL	📋 planned
003	DPO-pair signal density on real traces (≥5 pairs/trace, KL-distance from random)	📋 planned
004	A/B Qwen3-7B trained with GRPO vs GRPO+TR-DPO; success = ≥2 pt pass@1 with p<0.05	📋 planned

If you have GPU compute and want to run any of these, I'd love a collaborator. Spike 002a is ~$50–100 of Modal A100 time and about half a day of wallclock. Spike 004 (the terminal experiment) is ~$300 GPU + $50 eval over 6 training runs.

Lessons learned during this work

A few methodology lessons worth surfacing because they generalized:

Read primary sources yourself. I initially dispatched a parallel-research subagent to summarize the Composer 2.5 blog. The summary covered the targeted-textual-feedback method correctly but missed Cursor's footnote citing the SDPO/OPSD papers entirely — and added several extrapolations (Anyrun environment name, "85% post-training compute" ratio, specific benchmark scores) that aren't actually in the blog. After I read the blog directly, the integration story changed materially: I went from "we'll need to implement the hint loss from scratch" to "there's MIT-licensed reference code we can lift." I now have an audit notice in the research note flagging which claims are blog-verified vs extrapolated, and a separate COMPOSER_RECIPE_MAPPING.md doc that does the rigorous mapping. This same pattern applies broadly to LLM-driven research synthesis: the orchestrator must verify primary sources, not just relay subagent claims.

Verify framework extension points before designing on them. I used DeepWiki to audit huggingface/trl, volcengine/verl, and siyan-zhao/OPSD directly — getting back actual function names, file paths, and decorator surfaces. The integration architecture document cites each verification. Without that, the design would be plausible-sounding fan-fiction.

Risk-order your spikes. Spike 001 was the kill-switch. If teacher API costs were $50+/trace, the whole framework was dead. I ran it first, before any GPU work. It validated cleanly ($0.98/trace), and now everything downstream is confidence-bounded.

Pre-experimental publication has trade-offs. I considered waiting for spike 002–004 results before posting anything. Decided against because: the integration architecture is independently useful (other groups can plug in different channel-3 ideas), the SDPO/OPSD lift removes work for anyone trying to replicate Composer-style hint distillation, and early community feedback might catch design errors before I burn GPU budget. Cost: someone else may run the experiments first. I'm fine with that trade-off as long as the work is correctly attributed.

What's in the repo

🤗 huggingface.co/Codeseys/composer-replication-framework

composer-replication-framework/
├── README.md                                  ← model card (this post in shorter form)
├── publications/
│   ├── PAPER_v0.md                            ← longform methodology paper (this work)
│   └── BLOG_POST.md                           ← this blog post
├── docs/
│   ├── COMPOSER_RECIPE_MAPPING.md             ← Cursor blog audit, blog-verified vs extrapolated
│   ├── INTEGRATION_ARCHITECTURE.md            ← framework × channel matrix, sequence diagrams
│   ├── METHODOLOGY.md                         ← how the parallel research dispatch was run
│   └── HF_REPO_LAYOUT.md                      ← planned multi-repo split
├── framework/
│   └── composer-replication-framework.md      ← master synthesis (architecture spec)
├── research/                                  ← five deep-dives by five LLM families
│   └── 01..05*.md
└── spikes/
    ├── 001-teacher-replay-cost/               ← ✅ VALIDATED ($0.98/trace)
    ├── 005-integrated-trainer-skeleton/       ← ✅ COMPOSITION-VERIFIED (38/38 tests)
    └── 002a..004/                             ← 📋 planned

All MIT licensed. PRs welcome on the research notes if you find a misattribution.

Acknowledgements

Cursor team for the Composer 2.5 release and naming the technique. Siyan Zhao et al. for OPSD's reference implementation. Hübotter et al. for SDPO's formal treatment. HuggingFace TRL team for the cleanest possible _compute_loss extension surface. ByteDance VeRL team for the HybridFlow architecture. Meta for OpenEnv as the agentic-environment substrate.

If you're working on agentic-coding RL post-training and any of this looks useful, open a discussion on the repo — I'd particularly love to hear from teams running TRL or VeRL at scale who could tell me which extension surfaces I'm misreading.

This post is pre-experimental. v0.1 follow-up will incorporate spike 002–004 results.