baladithyab commited on
Commit
1cede23
·
1 Parent(s): 35581fd

Integrate Cursor blog directly + audit research note + add SDPO/OPSD link

Browse files

User flagged a real gap: I had dispatched a subagent to research Composer 2.5
but never personally read the Cursor blog. Audit findings:

1. The targeted-textual-feedback method = published SDPO (Hubotter et al.,
arXiv:2601.20802, ICLR 2026 Workshop) + OPSD (Zhao et al., arXiv:2601.18734,
code at github.com/siyan-zhao/OPSD, MIT). Cursor cites these directly in
footnote 1 of the blog. The subagent's research note never mentioned them.
This is huge: there's published, MIT-licensed code for Composer's secret sauce.

2. Several claims in research/01-composer-2.5.md are NOT in the Cursor blog
and were extrapolated from secondary sources ("85% post-training compute",
"Anyrun" environment name, specific benchmark scores, "PPO or GRPO variant").
Likely correct via community consensus but not blog-verified.

3. Composer hint-distill (single model, hint-conditioned context) and
trace-replay-distill (N external teachers) are TWO genuinely different
mechanisms, not competing implementations. The original synthesis blurred
them. This commit makes the distinction precise.

Changes:
- NEW docs/COMPOSER_RECIPE_MAPPING.md (16KB) — rigorous stage-by-stage
mapping of Cursor blog onto our spike plan, with [BLOG-VERIFIED] /
[INFERRED] / [EXTRAPOLATED] tags on every claim. Includes:
- The 3 footnote papers cited as primary sources
- Composer hint-distill vs trace-replay-distill comparison table (8 dims)
- Composer recipe -> v0.0/v0.1/v0.2 mapping table
- Why deferring SDPO to v0.1 is the right call
- Concrete v0.1 implementation handles (lift SDPO from siyan-zhao/OPSD)

- research/01-composer-2.5.md: prepended an audit notice flagging which
claims are NOT in the Cursor blog and pointing readers to the mapping doc.
Body preserved unchanged for provenance.

- framework/composer-replication-framework.md: TL;DR row 'reward signal' now
names SDPO/OPSD with arxiv + GitHub links. Trace-replay section now makes
the two-channel distinction crystal clear with cost/cardinality contrast.

- spikes/README.md: 'out of scope for v0.0' section now explains v0.1 SDPO
channel explicitly, with the 4-point rationale for deferring it. Adds the
v0.1 3-arm A/B (RLVR vs RLVR+SDPO vs RLVR+SDPO+trace-replay).

- README.md: headline finding 1 rewritten to identify the published SDPO
paper as the prior art for Composer's secret sauce. Headline 2 now cites
spike 001's empirical economic verdict and the two-channel distinction.

README.md CHANGED
@@ -54,24 +54,24 @@ Each of the five research deep-dives was authored by a **different LLM family**
54
 
55
  ## Headline findings
56
 
57
- ### 1. Composer 2.5's secret sauce is the *targeted hint-distillation loss*
58
 
59
- The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — and the one Cursor never explains in detail is:
60
 
61
- > **Targeted RL with Textual Feedback (on-policy distillation):** when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass *with* the hint to get "Teacher" logits, run forward pass *without* the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher *only at that turn*. Sidesteps the credit-assignment nightmare of long-horizon scalar rewards.
62
 
63
- This is the fix for "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is **how the text hints are generated** — Cursor never tells. Templates? Smaller model? Same model with introspection prompt? Open question.
64
 
65
- ### 2. The trace-replay multi-teacher idea is genuinely novel
66
 
67
- Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). **Multi-teacher *frozen-trace replay* with disagreement-as-reward is open territory.** Cost analysis works out: with VOI gating + tiered teachers, you get **~$3/trace** instead of **~$64/trace** at the 1000-step / 8-teacher baseline.
68
 
69
- The two distillation channels stack cleanly:
70
 
71
- - **Composer hint-distill** = teacher-self pulls student at error sites (per-turn KL)
72
- - **Trace-replay-distill** = N external teachers pull student at all sites (per-step DPO / PRM)
73
 
74
- Both bypass long-horizon credit assignment.
75
 
76
  ### 3. Recommended stack (verified across all 5 reports)
77
 
 
54
 
55
  ## Headline findings
56
 
57
+ ### 1. Composer 2.5's secret sauce is the *targeted hint-distillation loss* — and it's published as SDPO
58
 
59
+ The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — Cursor's "Targeted RL with Textual Feedback" — turns out to be **mathematically the same as the published SDPO method** (Hübotter et al., ICLR 2026 Workshop, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code at github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD), MIT licensed). Cursor cites this paper directly in the blog's footnote 1.
60
 
61
+ > **The mechanism:** when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass *with* the hint to get "Teacher" logits, run forward pass *without* the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher *only at that turn*. **Same model is both teacher and student** — the teacher is just "the model with hint inserted into context."
62
 
63
+ This sidesteps "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is **how the text hints are generated** — Cursor never tells. v0.1 will use template-based hints first; v0.2 will explore LLM-driven hint generators. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the rigorous stage-by-stage mapping of Cursor's blog onto our framework.
64
 
65
+ ### 2. The trace-replay multi-teacher idea is genuinely novel — and **economically viable** (verified)
66
 
67
+ Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). **Multi-teacher *frozen-trace replay* with disagreement-as-reward is open territory.** **Spike 001 (✅ VALIDATED, 2026-05-25)** measured the per-trace cost floor empirically: $0.98/trace ungated (vs. $5 cap), 5x headroom; with VOI gating in v0.1 we project ~$0.30/trace.
68
 
69
+ **Critical distinction:** Composer's hint-distill (SDPO, single model with hint context) and trace-replay-distill (N external teachers) are **two different mechanisms**, not competing implementations. They stack:
70
 
71
+ - **Composer hint-distill (SDPO)** = same-model self-teacher with hint context, pulls student at error sites. ~1 extra forward pass. No API cost.
72
+ - **Trace-replay-distill** = N external pretrained teachers, pull student at all steps. ~$0.30/trace with VOI gating. Novel.
73
 
74
+ v0.1 runs both. v0.0 (current) tests trace-replay alone vs. plain GRPO to falsify the novel claim cheaply.
75
 
76
  ### 3. Recommended stack (verified across all 5 reports)
77
 
docs/COMPOSER_RECIPE_MAPPING.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping
2
+
3
+ > **Audit date:** 2026-05-25 (post-hoc, after the parallel research dispatch).
4
+ > **Methodology:** Read [Cursor's blog](https://cursor.com/blog/composer-2-5) directly (`mcp_tavily_tavily_extract` advanced mode), then audit `research/01-composer-2.5.md` and `framework/composer-replication-framework.md` against ground truth. Mark every claim as either **`[BLOG-VERIFIED]`** (in the blog), **`[INFERRED]`** (reasonable extrapolation from blog + base-model knowledge), or **`[EXTRAPOLATED]`** (subagent added it, likely correct but not in the blog).
5
+
6
+ This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a *high* level but did not rigorously map each Composer stage onto the spike plan.
7
+
8
+ ## Composer 2.5's published recipe (5 components, blog-verified)
9
+
10
+ The Cursor blog discusses **only three** training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.
11
+
12
+ ### 1. **Targeted RL with Textual Feedback** `[BLOG-VERIFIED]`
13
+
14
+ > *"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."* — Cursor blog
15
+
16
+ **Mechanism, exactly:**
17
+ - **Same model** acts as both teacher and student. Not two separate models.
18
+ - The teacher is "the policy at this turn, *with* a hint inserted into the context."
19
+ - The student is "the policy at this turn, *without* the hint" (the original context).
20
+ - Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`, applied **only at the problematic turn**, not over the full trajectory.
21
+ - Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.
22
+
23
+ **Cited prior art** (Cursor's footnote 1):
24
+ - **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., 2026, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), [GitHub: siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.
25
+ - **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., 2026, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is **mathematically the same** as Composer's targeted-textual-feedback method. **There is published code.** Comparison table from the SDPO paper:
26
+
27
+ | Method | Sampling | Signal | Feedback |
28
+ |---|---|---|---|
29
+ | SFT / Distillation (Hinton 2015) | off-policy | rich | strong teacher |
30
+ | On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher |
31
+ | RLVR / GRPO (Lambert 2025) | on-policy | weak | environment |
32
+ | **SDPO (this paper / Composer)** | **on-policy** | **rich** | **environment** |
33
+
34
+ - **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).
35
+
36
+ **Key reproducibility gap (still unsolved):** *How are the hints generated?* The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. **This is the single most important open question for replication.**
37
+
38
+ ### 2. **Synthetic data at 25× scale** `[BLOG-VERIFIED]`
39
+
40
+ > *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* — Cursor blog
41
+
42
+ - **Feature Deletion** is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
43
+ - The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a **real risk**, not theoretical.
44
+ - "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.
45
+
46
+ ### 3. **Sharded Muon + dual mesh HSDP** `[BLOG-VERIFIED]`
47
+
48
+ > *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights."* — Cursor blog
49
+
50
+ - Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
51
+ - Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
52
+ - Optimizer step time on 1T model: **0.2 s**.
53
+
54
+ This is **infrastructure, not algorithm**. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.
55
+
56
+ ## What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`)
57
+
58
+ `research/01-composer-2.5.md` introduced these claims that are **NOT in the Cursor blog**. Most are likely correct from secondary sources, but they are not blog-verified.
59
+
60
+ | Claim | Source basis | Verdict |
61
+ |---|---|---|
62
+ | "85% of total compute is post-training" | `[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). | **Plausible but unverified.** Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. |
63
+ | Anyrun environment harness with LSP/file-I/O/terminal | `[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). | **Plausible** — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. |
64
+ | MLA + 1T/32B active + 384 experts + 256K ctx | `[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". | **Verified independently** via [Moonshot's K2.5 model card](https://huggingface.co/moonshotai). Correct. |
65
+ | CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | `[EXTRAPOLATED]` — blog doesn't quote benchmarks. | **Source unclear.** Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. |
66
+ | "PPO or GRPO variant" | `[EXTRAPOLATED]` — blog never names the RL algorithm. | **Educated guess.** Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits *on top of* an unspecified RLVR algorithm, so this is still open. |
67
+ | "Continued pretraining on heavily code-weighted data" | `[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). | Verified. |
68
+ | "Behavioral aspects: communication style, effort calibration" | `[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them. | Verified, but blog doesn't say *how* they're trained. The targeted-textual-feedback method is presumably also used here. |
69
+
70
+ ## Mapping each blog component → our replication framework
71
+
72
+ | Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 |
73
+ |---|---|---|---|---|---|
74
+ | **(a)** Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B` | ✗ | ✗ | ✗ |
75
+ | **(b)** Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite |
76
+ | **(c)** Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 | ✓ baseline | ✓ + DAPO patches | + decentralized rollouts |
77
+ | **(d)** Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | **Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD`** (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). | ✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator |
78
+ | **(e)** Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers |
79
+ | **(f)** Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) |
80
+ | **(g)** Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env | ✗ (small surface) | ✓ (build the monitor) | + RM-based penalty |
81
+
82
+ ## Critical relationship: Composer hint-distill vs. trace-replay-distill
83
+
84
+ These are **two different mechanisms**, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.
85
+
86
+ | Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) |
87
+ |---|---|---|
88
+ | Number of models | **1** (same model is teacher + student) | **N+1** (frozen N teachers + 1 trainable student) |
89
+ | What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs |
90
+ | Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) |
91
+ | Privileged information | Hint text in context | None — teachers see same state student sees |
92
+ | Source of hint / privileged info | **Open question.** Templates? LLM judge? | Not applicable |
93
+ | Relationship to RLVR | Adds dense per-turn signal *on top of* RLVR scalar reward | Same — adds dense per-step signal on top of RLVR |
94
+ | Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) |
95
+ | Published code? | **Yes — `siyan-zhao/OPSD` (MIT)** | Not yet — we're building it |
96
+ | Novel in the framework? | No — this is Composer's published recipe | **Yes — the v0.0 research bet** |
97
+
98
+ **Both channels stack on the same RLVR base.** The full v0.1 trainer has THREE reward channels:
99
+
100
+ 1. **RLVR** (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
101
+ 2. **Composer hint-distill** = SDPO loss (one extra forward pass per error site, hint-conditioned).
102
+ 3. **Trace-replay-distill** = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).
103
+
104
+ In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.
105
+
106
+ ## Why deferring Composer hint-distill to v0.1 is the right call
107
+
108
+ I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:
109
+
110
+ 1. **The novel claim is trace-replay.** The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
111
+ 2. **The hint-generator open question is unresolved.** Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
112
+ 3. **Spike 001's economic verdict only gates the trace-replay channel.** SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
113
+ 4. **A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm.** Not worth it for v0.0.
114
+
115
+ v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.
116
+
117
+ ## Implementation handles for v0.1 (concrete starting points)
118
+
119
+ When we get to v0.1, the **Composer hint-distill channel** has a clear engineering path:
120
+
121
+ 1. **Lift the SDPO loss math from `siyan-zhao/OPSD`.** MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
122
+ 2. **Hint generator v1: hardcoded templates.** Pattern-match on tool-call errors:
123
+ - `"Tool not found: X"` → hint = `"Reminder: Available tools are: <list of valid tools>"`
124
+ - `"JSONDecodeError: ..."` → hint = `"Reminder: tool arguments must be valid JSON"`
125
+ - `"Type error in args"` → hint = `"Reminder: <tool-name> expects args matching schema: <schema>"`
126
+ This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
127
+ 3. **Apply only at error sites,** not every turn. Detect via:
128
+ - Failed tool calls (status != ok)
129
+ - Exception traces in tool output
130
+ - Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
131
+ 4. **Loss = `α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss`.** Ablate `(α, β, γ)`.
132
+
133
+ ## Implementation handles for v0.2 (decentralized scale)
134
+
135
+ If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:
136
+
137
+ - **Continued pretraining:** Pretrained checkpoint already exists (Qwen3-32B); skip.
138
+ - **Synthetic data:** Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
139
+ - **Realistic-env RL:** PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
140
+ - **Targeted hint-distill:** Compute is local to each trainer — no decentralization complication.
141
+ - **Trace-replay-distill:** Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
142
+ - **Sharded Muon / dual-mesh HSDP:** Only if we adopt MoE base. For dense 32B, FSDP2 is fine.
143
+
144
+ ## Citations (updated)
145
+
146
+ Primary sources for each Composer-2.5 component, post-audit:
147
+
148
+ - **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026)
149
+ - **Cursor blog** — [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report) (predecessor; named the "Anyrun" environment per subagent — verify if needed)
150
+ - **OPSD paper** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs*, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD). MIT.
151
+ - **SDPO paper** — Hübotter et al., *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
152
+ - **Self-Distillation continual-learning** — [arXiv:2601.19897](https://arxiv.org/abs/2601.19897). Cited by Cursor; less directly relevant.
153
+ - **Moonshot Kimi K2.5** — base model, [HF model card](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
154
+
155
+ The methodology mapping in this document supersedes vague claims in `research/01-composer-2.5.md` where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.
framework/composer-replication-framework.md CHANGED
@@ -13,7 +13,7 @@
13
  | **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
14
  | **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
15
  | **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
16
- | **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Targeted hint distillation** (Composer's secret sauce), (3) **Trace-replay multi-teacher PRM** (your novel idea) | Composer proved (1)+(2) work; (3) is genuinely novel and stacks cleanly |
17
  | **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
18
  | **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
19
 
@@ -119,12 +119,12 @@ From `05-trace-replay-distillation.md`:
119
  3. Get N candidate `action_t` distributions.
120
  4. Use disagreement / agreement as a **per-step reward signal** for the student model.
121
 
122
- **This stacks beautifully with Composer's hint-distillation.** Composer's hint-distill is "when student errs, generate hint, pull student toward hint-conditioned-self." Trace-replay-distill is "at every step, pull student toward the consensus of N teachers." Together:
123
 
124
- - Composer's hint-loss = **teacher-self pulls student** at error sites
125
- - Trace-replay-loss = **N external teachers pull student** at all sites (or high-uncertainty sites with VOI gating)
126
 
127
- These are *complementary*, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem.
128
 
129
  **Cost mitigation** (the report does this analysis well):
130
  - VOI gating (only query teachers when student entropy is high) → 60-80% savings
 
13
  | **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
14
  | **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
15
  | **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
16
+ | **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. |
17
  | **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
18
  | **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
19
 
 
119
  3. Get N candidate `action_t` distributions.
120
  4. Use disagreement / agreement as a **per-step reward signal** for the student model.
121
 
122
+ **This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together:
123
 
124
+ - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
125
+ - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
126
 
127
+ These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
128
 
129
  **Cost mitigation** (the report does this analysis well):
130
  - VOI gating (only query teachers when student entropy is high) → 60-80% savings
research/01-composer-2.5.md CHANGED
@@ -1,5 +1,15 @@
1
  # Cursor Composer 2.5: Deep Research Report
2
 
 
 
 
 
 
 
 
 
 
 
3
  ## Overview
4
  Cursor's Composer 2.5 is an advanced agentic coding model that powers the Cursor IDE. Released in mid-May 2026, it represents a massive leap in agentic capabilities, particularly for long-running, multi-file software engineering tasks. While the base weights are Moonshot AI's open-source **Kimi K2.5** model, roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline.
5
 
 
1
  # Cursor Composer 2.5: Deep Research Report
2
 
3
+ > **⚠️ Audit notice (added 2026-05-25, post-hoc):** This file is a snapshot of the parallel-research dispatch output (Gemini 3.1 Pro, ~10-min web-research subagent). It was **not** rigorously cross-checked against the Cursor blog at write-time. A rigorous audit and stage-by-stage mapping was added later at [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md). When this file and that file conflict, **trust the mapping document** — it was written after directly reading the Cursor blog with `tavily_extract`. This file is preserved unchanged for research provenance.
4
+ >
5
+ > Specific claims in this file that are **NOT** in the Cursor blog (and are extrapolated from secondary sources):
6
+ > - "85% of total compute is post-training" — community consensus, not Cursor-stated
7
+ > - "Anyrun" environment harness with LSP/file-I/O/terminal — likely from the Composer 2 report, not 2.5
8
+ > - "CursorBench 69.3%, Terminal-Bench 2.0 parity" — not in the 2.5 blog
9
+ > - "PPO or GRPO variant" — blog never names the RL algorithm
10
+ >
11
+ > The targeted-textual-feedback method is correctly described, but this file does **not** cite the three self-distillation papers Cursor cites in footnote 1 (OPSD `arXiv:2601.18734`, SDPO `arXiv:2601.20802`, Self-Distillation Continual Learning `arXiv:2601.19897`). The mapping document does.
12
+
13
  ## Overview
14
  Cursor's Composer 2.5 is an advanced agentic coding model that powers the Cursor IDE. Released in mid-May 2026, it represents a massive leap in agentic capabilities, particularly for long-running, multi-file software engineering tasks. While the base weights are Moonshot AI's open-source **Kimi K2.5** model, roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline.
15
 
spikes/README.md CHANGED
@@ -23,13 +23,22 @@
23
 
24
  ## Out of scope for v0.0 (deferred to v0.1)
25
 
26
- - Composer's hint-distillation loss (the per-turn KL from a hint-conditioned forward pass)
27
- - The Feature Deletion environment (use SWE-bench-lite as the env)
28
  - DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
29
  - Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
30
  - MoE base (use dense Qwen3-7B; saner v0.0 target)
31
  - VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
32
 
 
 
 
 
 
 
 
 
 
33
  ## Budget
34
 
35
  | Item | Estimate | Source |
 
23
 
24
  ## Out of scope for v0.0 (deferred to v0.1)
25
 
26
+ - **Composer hint-distill = SDPO/OPSD** (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. **Code is published** at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD); paper [arXiv:2601.20802](https://arxiv.org/abs/2601.20802). Lift the loss for v0.1 — see `docs/COMPOSER_RECIPE_MAPPING.md` § "Implementation handles for v0.1" for the concrete plan.
27
+ - The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
28
  - DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
29
  - Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
30
  - MoE base (use dense Qwen3-7B; saner v0.0 target)
31
  - VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
32
 
33
+ **Why deferring SDPO/hint-distill to v0.1 is the right call:**
34
+
35
+ 1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
36
+ 2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
37
+ 3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
38
+ 4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.
39
+
40
+ v0.1 will run a 3-arm A/B: **RLVR** vs. **RLVR + SDPO** vs. **RLVR + SDPO + trace-replay-DPO** at 32B once we know v0.0's trace-replay verdict.
41
+
42
  ## Budget
43
 
44
  | Item | Estimate | Source |