Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Integrate Cursor blog directly + audit research note + add SDPO/OPSD link
Browse filesUser flagged a real gap: I had dispatched a subagent to research Composer 2.5
but never personally read the Cursor blog. Audit findings:
1. The targeted-textual-feedback method = published SDPO (Hubotter et al.,
arXiv:2601.20802, ICLR 2026 Workshop) + OPSD (Zhao et al., arXiv:2601.18734,
code at github.com/siyan-zhao/OPSD, MIT). Cursor cites these directly in
footnote 1 of the blog. The subagent's research note never mentioned them.
This is huge: there's published, MIT-licensed code for Composer's secret sauce.
2. Several claims in research/01-composer-2.5.md are NOT in the Cursor blog
and were extrapolated from secondary sources ("85% post-training compute",
"Anyrun" environment name, specific benchmark scores, "PPO or GRPO variant").
Likely correct via community consensus but not blog-verified.
3. Composer hint-distill (single model, hint-conditioned context) and
trace-replay-distill (N external teachers) are TWO genuinely different
mechanisms, not competing implementations. The original synthesis blurred
them. This commit makes the distinction precise.
Changes:
- NEW docs/COMPOSER_RECIPE_MAPPING.md (16KB) — rigorous stage-by-stage
mapping of Cursor blog onto our spike plan, with [BLOG-VERIFIED] /
[INFERRED] / [EXTRAPOLATED] tags on every claim. Includes:
- The 3 footnote papers cited as primary sources
- Composer hint-distill vs trace-replay-distill comparison table (8 dims)
- Composer recipe -> v0.0/v0.1/v0.2 mapping table
- Why deferring SDPO to v0.1 is the right call
- Concrete v0.1 implementation handles (lift SDPO from siyan-zhao/OPSD)
- research/01-composer-2.5.md: prepended an audit notice flagging which
claims are NOT in the Cursor blog and pointing readers to the mapping doc.
Body preserved unchanged for provenance.
- framework/composer-replication-framework.md: TL;DR row 'reward signal' now
names SDPO/OPSD with arxiv + GitHub links. Trace-replay section now makes
the two-channel distinction crystal clear with cost/cardinality contrast.
- spikes/README.md: 'out of scope for v0.0' section now explains v0.1 SDPO
channel explicitly, with the 4-point rationale for deferring it. Adds the
v0.1 3-arm A/B (RLVR vs RLVR+SDPO vs RLVR+SDPO+trace-replay).
- README.md: headline finding 1 rewritten to identify the published SDPO
paper as the prior art for Composer's secret sauce. Headline 2 now cites
spike 001's empirical economic verdict and the two-channel distinction.
- README.md +10 -10
- docs/COMPOSER_RECIPE_MAPPING.md +155 -0
- framework/composer-replication-framework.md +5 -5
- research/01-composer-2.5.md +10 -0
- spikes/README.md +11 -2
|
@@ -54,24 +54,24 @@ Each of the five research deep-dives was authored by a **different LLM family**
|
|
| 54 |
|
| 55 |
## Headline findings
|
| 56 |
|
| 57 |
-
### 1. Composer 2.5's secret sauce is the *targeted hint-distillation loss*
|
| 58 |
|
| 59 |
-
The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one —
|
| 60 |
|
| 61 |
-
> **
|
| 62 |
|
| 63 |
-
This
|
| 64 |
|
| 65 |
-
### 2. The trace-replay multi-teacher idea is genuinely novel
|
| 66 |
|
| 67 |
-
Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). **Multi-teacher *frozen-trace replay* with disagreement-as-reward is open territory.**
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
- **Composer hint-distill** =
|
| 72 |
-
- **Trace-replay-distill** = N external teachers pull student at all
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
### 3. Recommended stack (verified across all 5 reports)
|
| 77 |
|
|
|
|
| 54 |
|
| 55 |
## Headline findings
|
| 56 |
|
| 57 |
+
### 1. Composer 2.5's secret sauce is the *targeted hint-distillation loss* — and it's published as SDPO
|
| 58 |
|
| 59 |
+
The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — Cursor's "Targeted RL with Textual Feedback" — turns out to be **mathematically the same as the published SDPO method** (Hübotter et al., ICLR 2026 Workshop, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code at github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD), MIT licensed). Cursor cites this paper directly in the blog's footnote 1.
|
| 60 |
|
| 61 |
+
> **The mechanism:** when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass *with* the hint to get "Teacher" logits, run forward pass *without* the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher *only at that turn*. **Same model is both teacher and student** — the teacher is just "the model with hint inserted into context."
|
| 62 |
|
| 63 |
+
This sidesteps "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is **how the text hints are generated** — Cursor never tells. v0.1 will use template-based hints first; v0.2 will explore LLM-driven hint generators. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the rigorous stage-by-stage mapping of Cursor's blog onto our framework.
|
| 64 |
|
| 65 |
+
### 2. The trace-replay multi-teacher idea is genuinely novel — and **economically viable** (verified)
|
| 66 |
|
| 67 |
+
Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). **Multi-teacher *frozen-trace replay* with disagreement-as-reward is open territory.** **Spike 001 (✅ VALIDATED, 2026-05-25)** measured the per-trace cost floor empirically: $0.98/trace ungated (vs. $5 cap), 5x headroom; with VOI gating in v0.1 we project ~$0.30/trace.
|
| 68 |
|
| 69 |
+
**Critical distinction:** Composer's hint-distill (SDPO, single model with hint context) and trace-replay-distill (N external teachers) are **two different mechanisms**, not competing implementations. They stack:
|
| 70 |
|
| 71 |
+
- **Composer hint-distill (SDPO)** = same-model self-teacher with hint context, pulls student at error sites. ~1 extra forward pass. No API cost.
|
| 72 |
+
- **Trace-replay-distill** = N external pretrained teachers, pull student at all steps. ~$0.30/trace with VOI gating. Novel.
|
| 73 |
|
| 74 |
+
v0.1 runs both. v0.0 (current) tests trace-replay alone vs. plain GRPO to falsify the novel claim cheaply.
|
| 75 |
|
| 76 |
### 3. Recommended stack (verified across all 5 reports)
|
| 77 |
|
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping
|
| 2 |
+
|
| 3 |
+
> **Audit date:** 2026-05-25 (post-hoc, after the parallel research dispatch).
|
| 4 |
+
> **Methodology:** Read [Cursor's blog](https://cursor.com/blog/composer-2-5) directly (`mcp_tavily_tavily_extract` advanced mode), then audit `research/01-composer-2.5.md` and `framework/composer-replication-framework.md` against ground truth. Mark every claim as either **`[BLOG-VERIFIED]`** (in the blog), **`[INFERRED]`** (reasonable extrapolation from blog + base-model knowledge), or **`[EXTRAPOLATED]`** (subagent added it, likely correct but not in the blog).
|
| 5 |
+
|
| 6 |
+
This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a *high* level but did not rigorously map each Composer stage onto the spike plan.
|
| 7 |
+
|
| 8 |
+
## Composer 2.5's published recipe (5 components, blog-verified)
|
| 9 |
+
|
| 10 |
+
The Cursor blog discusses **only three** training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.
|
| 11 |
+
|
| 12 |
+
### 1. **Targeted RL with Textual Feedback** `[BLOG-VERIFIED]`
|
| 13 |
+
|
| 14 |
+
> *"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."* — Cursor blog
|
| 15 |
+
|
| 16 |
+
**Mechanism, exactly:**
|
| 17 |
+
- **Same model** acts as both teacher and student. Not two separate models.
|
| 18 |
+
- The teacher is "the policy at this turn, *with* a hint inserted into the context."
|
| 19 |
+
- The student is "the policy at this turn, *without* the hint" (the original context).
|
| 20 |
+
- Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`, applied **only at the problematic turn**, not over the full trajectory.
|
| 21 |
+
- Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.
|
| 22 |
+
|
| 23 |
+
**Cited prior art** (Cursor's footnote 1):
|
| 24 |
+
- **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., 2026, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), [GitHub: siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.
|
| 25 |
+
- **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., 2026, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is **mathematically the same** as Composer's targeted-textual-feedback method. **There is published code.** Comparison table from the SDPO paper:
|
| 26 |
+
|
| 27 |
+
| Method | Sampling | Signal | Feedback |
|
| 28 |
+
|---|---|---|---|
|
| 29 |
+
| SFT / Distillation (Hinton 2015) | off-policy | rich | strong teacher |
|
| 30 |
+
| On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher |
|
| 31 |
+
| RLVR / GRPO (Lambert 2025) | on-policy | weak | environment |
|
| 32 |
+
| **SDPO (this paper / Composer)** | **on-policy** | **rich** | **environment** |
|
| 33 |
+
|
| 34 |
+
- **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).
|
| 35 |
+
|
| 36 |
+
**Key reproducibility gap (still unsolved):** *How are the hints generated?* The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. **This is the single most important open question for replication.**
|
| 37 |
+
|
| 38 |
+
### 2. **Synthetic data at 25× scale** `[BLOG-VERIFIED]`
|
| 39 |
+
|
| 40 |
+
> *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* — Cursor blog
|
| 41 |
+
|
| 42 |
+
- **Feature Deletion** is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
|
| 43 |
+
- The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a **real risk**, not theoretical.
|
| 44 |
+
- "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.
|
| 45 |
+
|
| 46 |
+
### 3. **Sharded Muon + dual mesh HSDP** `[BLOG-VERIFIED]`
|
| 47 |
+
|
| 48 |
+
> *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights."* — Cursor blog
|
| 49 |
+
|
| 50 |
+
- Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
|
| 51 |
+
- Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
|
| 52 |
+
- Optimizer step time on 1T model: **0.2 s**.
|
| 53 |
+
|
| 54 |
+
This is **infrastructure, not algorithm**. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.
|
| 55 |
+
|
| 56 |
+
## What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`)
|
| 57 |
+
|
| 58 |
+
`research/01-composer-2.5.md` introduced these claims that are **NOT in the Cursor blog**. Most are likely correct from secondary sources, but they are not blog-verified.
|
| 59 |
+
|
| 60 |
+
| Claim | Source basis | Verdict |
|
| 61 |
+
|---|---|---|
|
| 62 |
+
| "85% of total compute is post-training" | `[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). | **Plausible but unverified.** Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. |
|
| 63 |
+
| Anyrun environment harness with LSP/file-I/O/terminal | `[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). | **Plausible** — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. |
|
| 64 |
+
| MLA + 1T/32B active + 384 experts + 256K ctx | `[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". | **Verified independently** via [Moonshot's K2.5 model card](https://huggingface.co/moonshotai). Correct. |
|
| 65 |
+
| CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | `[EXTRAPOLATED]` — blog doesn't quote benchmarks. | **Source unclear.** Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. |
|
| 66 |
+
| "PPO or GRPO variant" | `[EXTRAPOLATED]` — blog never names the RL algorithm. | **Educated guess.** Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits *on top of* an unspecified RLVR algorithm, so this is still open. |
|
| 67 |
+
| "Continued pretraining on heavily code-weighted data" | `[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). | Verified. |
|
| 68 |
+
| "Behavioral aspects: communication style, effort calibration" | `[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them. | Verified, but blog doesn't say *how* they're trained. The targeted-textual-feedback method is presumably also used here. |
|
| 69 |
+
|
| 70 |
+
## Mapping each blog component → our replication framework
|
| 71 |
+
|
| 72 |
+
| Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 |
|
| 73 |
+
|---|---|---|---|---|---|
|
| 74 |
+
| **(a)** Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B` | ✗ | ✗ | ✗ |
|
| 75 |
+
| **(b)** Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite |
|
| 76 |
+
| **(c)** Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 | ✓ baseline | ✓ + DAPO patches | + decentralized rollouts |
|
| 77 |
+
| **(d)** Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | **Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD`** (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). | ✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator |
|
| 78 |
+
| **(e)** Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers |
|
| 79 |
+
| **(f)** Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) |
|
| 80 |
+
| **(g)** Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env | ✗ (small surface) | ✓ (build the monitor) | + RM-based penalty |
|
| 81 |
+
|
| 82 |
+
## Critical relationship: Composer hint-distill vs. trace-replay-distill
|
| 83 |
+
|
| 84 |
+
These are **two different mechanisms**, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.
|
| 85 |
+
|
| 86 |
+
| Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) |
|
| 87 |
+
|---|---|---|
|
| 88 |
+
| Number of models | **1** (same model is teacher + student) | **N+1** (frozen N teachers + 1 trainable student) |
|
| 89 |
+
| What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs |
|
| 90 |
+
| Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) |
|
| 91 |
+
| Privileged information | Hint text in context | None — teachers see same state student sees |
|
| 92 |
+
| Source of hint / privileged info | **Open question.** Templates? LLM judge? | Not applicable |
|
| 93 |
+
| Relationship to RLVR | Adds dense per-turn signal *on top of* RLVR scalar reward | Same — adds dense per-step signal on top of RLVR |
|
| 94 |
+
| Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) |
|
| 95 |
+
| Published code? | **Yes — `siyan-zhao/OPSD` (MIT)** | Not yet — we're building it |
|
| 96 |
+
| Novel in the framework? | No — this is Composer's published recipe | **Yes — the v0.0 research bet** |
|
| 97 |
+
|
| 98 |
+
**Both channels stack on the same RLVR base.** The full v0.1 trainer has THREE reward channels:
|
| 99 |
+
|
| 100 |
+
1. **RLVR** (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
|
| 101 |
+
2. **Composer hint-distill** = SDPO loss (one extra forward pass per error site, hint-conditioned).
|
| 102 |
+
3. **Trace-replay-distill** = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).
|
| 103 |
+
|
| 104 |
+
In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.
|
| 105 |
+
|
| 106 |
+
## Why deferring Composer hint-distill to v0.1 is the right call
|
| 107 |
+
|
| 108 |
+
I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:
|
| 109 |
+
|
| 110 |
+
1. **The novel claim is trace-replay.** The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
|
| 111 |
+
2. **The hint-generator open question is unresolved.** Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
|
| 112 |
+
3. **Spike 001's economic verdict only gates the trace-replay channel.** SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
|
| 113 |
+
4. **A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm.** Not worth it for v0.0.
|
| 114 |
+
|
| 115 |
+
v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.
|
| 116 |
+
|
| 117 |
+
## Implementation handles for v0.1 (concrete starting points)
|
| 118 |
+
|
| 119 |
+
When we get to v0.1, the **Composer hint-distill channel** has a clear engineering path:
|
| 120 |
+
|
| 121 |
+
1. **Lift the SDPO loss math from `siyan-zhao/OPSD`.** MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
|
| 122 |
+
2. **Hint generator v1: hardcoded templates.** Pattern-match on tool-call errors:
|
| 123 |
+
- `"Tool not found: X"` → hint = `"Reminder: Available tools are: <list of valid tools>"`
|
| 124 |
+
- `"JSONDecodeError: ..."` → hint = `"Reminder: tool arguments must be valid JSON"`
|
| 125 |
+
- `"Type error in args"` → hint = `"Reminder: <tool-name> expects args matching schema: <schema>"`
|
| 126 |
+
This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
|
| 127 |
+
3. **Apply only at error sites,** not every turn. Detect via:
|
| 128 |
+
- Failed tool calls (status != ok)
|
| 129 |
+
- Exception traces in tool output
|
| 130 |
+
- Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
|
| 131 |
+
4. **Loss = `α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss`.** Ablate `(α, β, γ)`.
|
| 132 |
+
|
| 133 |
+
## Implementation handles for v0.2 (decentralized scale)
|
| 134 |
+
|
| 135 |
+
If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:
|
| 136 |
+
|
| 137 |
+
- **Continued pretraining:** Pretrained checkpoint already exists (Qwen3-32B); skip.
|
| 138 |
+
- **Synthetic data:** Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
|
| 139 |
+
- **Realistic-env RL:** PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
|
| 140 |
+
- **Targeted hint-distill:** Compute is local to each trainer — no decentralization complication.
|
| 141 |
+
- **Trace-replay-distill:** Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
|
| 142 |
+
- **Sharded Muon / dual-mesh HSDP:** Only if we adopt MoE base. For dense 32B, FSDP2 is fine.
|
| 143 |
+
|
| 144 |
+
## Citations (updated)
|
| 145 |
+
|
| 146 |
+
Primary sources for each Composer-2.5 component, post-audit:
|
| 147 |
+
|
| 148 |
+
- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026)
|
| 149 |
+
- **Cursor blog** — [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report) (predecessor; named the "Anyrun" environment per subagent — verify if needed)
|
| 150 |
+
- **OPSD paper** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs*, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD). MIT.
|
| 151 |
+
- **SDPO paper** — Hübotter et al., *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
|
| 152 |
+
- **Self-Distillation continual-learning** — [arXiv:2601.19897](https://arxiv.org/abs/2601.19897). Cited by Cursor; less directly relevant.
|
| 153 |
+
- **Moonshot Kimi K2.5** — base model, [HF model card](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
| 154 |
+
|
| 155 |
+
The methodology mapping in this document supersedes vague claims in `research/01-composer-2.5.md` where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.
|
|
@@ -13,7 +13,7 @@
|
|
| 13 |
| **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
|
| 14 |
| **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
|
| 15 |
| **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
|
| 16 |
-
| **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **
|
| 17 |
| **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
|
| 18 |
| **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
|
| 19 |
|
|
@@ -119,12 +119,12 @@ From `05-trace-replay-distillation.md`:
|
|
| 119 |
3. Get N candidate `action_t` distributions.
|
| 120 |
4. Use disagreement / agreement as a **per-step reward signal** for the student model.
|
| 121 |
|
| 122 |
-
**This stacks beautifully with Composer's hint-distillation.** Composer's hint-distill
|
| 123 |
|
| 124 |
-
- Composer's hint-loss = **
|
| 125 |
-
- Trace-replay-loss = **N external teachers pull student
|
| 126 |
|
| 127 |
-
These are *complementary*, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem.
|
| 128 |
|
| 129 |
**Cost mitigation** (the report does this analysis well):
|
| 130 |
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
|
|
|
|
| 13 |
| **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
|
| 14 |
| **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
|
| 15 |
| **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
|
| 16 |
+
| **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. |
|
| 17 |
| **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
|
| 18 |
| **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
|
| 19 |
|
|
|
|
| 119 |
3. Get N candidate `action_t` distributions.
|
| 120 |
4. Use disagreement / agreement as a **per-step reward signal** for the student model.
|
| 121 |
|
| 122 |
+
**This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together:
|
| 123 |
|
| 124 |
+
- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
|
| 125 |
+
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
|
| 126 |
|
| 127 |
+
These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
|
| 128 |
|
| 129 |
**Cost mitigation** (the report does this analysis well):
|
| 130 |
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
|
|
@@ -1,5 +1,15 @@
|
|
| 1 |
# Cursor Composer 2.5: Deep Research Report
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## Overview
|
| 4 |
Cursor's Composer 2.5 is an advanced agentic coding model that powers the Cursor IDE. Released in mid-May 2026, it represents a massive leap in agentic capabilities, particularly for long-running, multi-file software engineering tasks. While the base weights are Moonshot AI's open-source **Kimi K2.5** model, roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline.
|
| 5 |
|
|
|
|
| 1 |
# Cursor Composer 2.5: Deep Research Report
|
| 2 |
|
| 3 |
+
> **⚠️ Audit notice (added 2026-05-25, post-hoc):** This file is a snapshot of the parallel-research dispatch output (Gemini 3.1 Pro, ~10-min web-research subagent). It was **not** rigorously cross-checked against the Cursor blog at write-time. A rigorous audit and stage-by-stage mapping was added later at [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md). When this file and that file conflict, **trust the mapping document** — it was written after directly reading the Cursor blog with `tavily_extract`. This file is preserved unchanged for research provenance.
|
| 4 |
+
>
|
| 5 |
+
> Specific claims in this file that are **NOT** in the Cursor blog (and are extrapolated from secondary sources):
|
| 6 |
+
> - "85% of total compute is post-training" — community consensus, not Cursor-stated
|
| 7 |
+
> - "Anyrun" environment harness with LSP/file-I/O/terminal — likely from the Composer 2 report, not 2.5
|
| 8 |
+
> - "CursorBench 69.3%, Terminal-Bench 2.0 parity" — not in the 2.5 blog
|
| 9 |
+
> - "PPO or GRPO variant" — blog never names the RL algorithm
|
| 10 |
+
>
|
| 11 |
+
> The targeted-textual-feedback method is correctly described, but this file does **not** cite the three self-distillation papers Cursor cites in footnote 1 (OPSD `arXiv:2601.18734`, SDPO `arXiv:2601.20802`, Self-Distillation Continual Learning `arXiv:2601.19897`). The mapping document does.
|
| 12 |
+
|
| 13 |
## Overview
|
| 14 |
Cursor's Composer 2.5 is an advanced agentic coding model that powers the Cursor IDE. Released in mid-May 2026, it represents a massive leap in agentic capabilities, particularly for long-running, multi-file software engineering tasks. While the base weights are Moonshot AI's open-source **Kimi K2.5** model, roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline.
|
| 15 |
|
|
@@ -23,13 +23,22 @@
|
|
| 23 |
|
| 24 |
## Out of scope for v0.0 (deferred to v0.1)
|
| 25 |
|
| 26 |
-
- Composer
|
| 27 |
-
- The Feature Deletion environment (use SWE-bench-lite as the env)
|
| 28 |
- DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
|
| 29 |
- Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
|
| 30 |
- MoE base (use dense Qwen3-7B; saner v0.0 target)
|
| 31 |
- VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## Budget
|
| 34 |
|
| 35 |
| Item | Estimate | Source |
|
|
|
|
| 23 |
|
| 24 |
## Out of scope for v0.0 (deferred to v0.1)
|
| 25 |
|
| 26 |
+
- **Composer hint-distill = SDPO/OPSD** (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. **Code is published** at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD); paper [arXiv:2601.20802](https://arxiv.org/abs/2601.20802). Lift the loss for v0.1 — see `docs/COMPOSER_RECIPE_MAPPING.md` § "Implementation handles for v0.1" for the concrete plan.
|
| 27 |
+
- The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
|
| 28 |
- DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
|
| 29 |
- Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
|
| 30 |
- MoE base (use dense Qwen3-7B; saner v0.0 target)
|
| 31 |
- VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
|
| 32 |
|
| 33 |
+
**Why deferring SDPO/hint-distill to v0.1 is the right call:**
|
| 34 |
+
|
| 35 |
+
1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
|
| 36 |
+
2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
|
| 37 |
+
3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
|
| 38 |
+
4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.
|
| 39 |
+
|
| 40 |
+
v0.1 will run a 3-arm A/B: **RLVR** vs. **RLVR + SDPO** vs. **RLVR + SDPO + trace-replay-DPO** at 32B once we know v0.0's trace-replay verdict.
|
| 41 |
+
|
| 42 |
## Budget
|
| 43 |
|
| 44 |
| Item | Estimate | Source |
|