Integrate Cursor blog directly + audit research note + add SDPO/OPSD link

User flagged a real gap: I had dispatched a subagent to research Composer 2.5
but never personally read the Cursor blog. Audit findings:

1. The targeted-textual-feedback method = published SDPO (Hubotter et al.,
arXiv:2601.20802, ICLR 2026 Workshop) + OPSD (Zhao et al., arXiv:2601.18734,
code at github.com/siyan-zhao/OPSD, MIT). Cursor cites these directly in
footnote 1 of the blog. The subagent's research note never mentioned them.
This is huge: there's published, MIT-licensed code for Composer's secret sauce.

2. Several claims in research/01-composer-2.5.md are NOT in the Cursor blog
and were extrapolated from secondary sources ("85% post-training compute",
"Anyrun" environment name, specific benchmark scores, "PPO or GRPO variant").
Likely correct via community consensus but not blog-verified.

3. Composer hint-distill (single model, hint-conditioned context) and
trace-replay-distill (N external teachers) are TWO genuinely different
mechanisms, not competing implementations. The original synthesis blurred
them. This commit makes the distinction precise.

Changes:
- NEW docs/COMPOSER_RECIPE_MAPPING.md (16KB) — rigorous stage-by-stage
mapping of Cursor blog onto our spike plan, with [BLOG-VERIFIED] /
[INFERRED] / [EXTRAPOLATED] tags on every claim. Includes:
- The 3 footnote papers cited as primary sources
- Composer hint-distill vs trace-replay-distill comparison table (8 dims)
- Composer recipe -> v0.0/v0.1/v0.2 mapping table
- Why deferring SDPO to v0.1 is the right call
- Concrete v0.1 implementation handles (lift SDPO from siyan-zhao/OPSD)

- research/01-composer-2.5.md: prepended an audit notice flagging which
claims are NOT in the Cursor blog and pointing readers to the mapping doc.
Body preserved unchanged for provenance.

- framework/composer-replication-framework.md: TL;DR row 'reward signal' now
names SDPO/OPSD with arxiv + GitHub links. Trace-replay section now makes
the two-channel distinction crystal clear with cost/cardinality contrast.

- spikes/README.md: 'out of scope for v0.0' section now explains v0.1 SDPO
channel explicitly, with the 4-point rationale for deferring it. Adds the
v0.1 3-arm A/B (RLVR vs RLVR+SDPO vs RLVR+SDPO+trace-replay).

- README.md: headline finding 1 rewritten to identify the published SDPO
paper as the prior art for Composer's secret sauce. Headline 2 now cites
spike 001's empirical economic verdict and the two-channel distinction.

Files changed (5) hide show

README.md +10 -10
docs/COMPOSER_RECIPE_MAPPING.md +155 -0
framework/composer-replication-framework.md +5 -5
research/01-composer-2.5.md +10 -0
spikes/README.md +11 -2

README.md CHANGED Viewed

@@ -54,24 +54,24 @@ Each of the five research deep-dives was authored by a **different LLM family**
 ## Headline findings
-### 1. Composer 2.5's secret sauce is the *targeted hint-distillation loss*
-The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — and the one Cursor never explains in detail — is:
-> **Targeted RL with Textual Feedback (on-policy distillation):** when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass *with* the hint to get "Teacher" logits, run forward pass *without* the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher *only at that turn*. Sidesteps the credit-assignment nightmare of long-horizon scalar rewards.
-This is the fix for "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is **how the text hints are generated** — Cursor never tells. Templates? Smaller model? Same model with introspection prompt? Open question.
-### 2. The trace-replay multi-teacher idea is genuinely novel
-Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). **Multi-teacher *frozen-trace replay* with disagreement-as-reward is open territory.** Cost analysis works out: with VOI gating + tiered teachers, you get **~$3/trace** instead of **~$64/trace** at the 1000-step / 8-teacher baseline.
-The two distillation channels stack cleanly:
-- **Composer hint-distill** = teacher-self pulls student at error sites (per-turn KL)
-- **Trace-replay-distill** = N external teachers pull student at all sites (per-step DPO / PRM)
-Both bypass long-horizon credit assignment.
 ### 3. Recommended stack (verified across all 5 reports)

 ## Headline findings
+### 1. Composer 2.5's secret sauce is the *targeted hint-distillation loss* — and it's published as SDPO
+The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — Cursor's "Targeted RL with Textual Feedback" — turns out to be **mathematically the same as the published SDPO method** (Hübotter et al., ICLR 2026 Workshop, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code at github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD), MIT licensed). Cursor cites this paper directly in the blog's footnote 1.
+> **The mechanism:** when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass *with* the hint to get "Teacher" logits, run forward pass *without* the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher *only at that turn*. **Same model is both teacher and student** — the teacher is just "the model with hint inserted into context."
+This sidesteps "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is **how the text hints are generated** — Cursor never tells.  v0.1 will use template-based hints first; v0.2 will explore LLM-driven hint generators. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the rigorous stage-by-stage mapping of Cursor's blog onto our framework.
+### 2. The trace-replay multi-teacher idea is genuinely novel — and **economically viable** (verified)
+Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). **Multi-teacher *frozen-trace replay* with disagreement-as-reward is open territory.** **Spike 001 (✅ VALIDATED, 2026-05-25)** measured the per-trace cost floor empirically: $0.98/trace ungated (vs. $5 cap), 5x headroom; with VOI gating in v0.1 we project ~$0.30/trace.
+**Critical distinction:** Composer's hint-distill (SDPO, single model with hint context) and trace-replay-distill (N external teachers) are **two different mechanisms**, not competing implementations. They stack:
+- **Composer hint-distill (SDPO)** = same-model self-teacher with hint context, pulls student at error sites. ~1 extra forward pass. No API cost.
+- **Trace-replay-distill** = N external pretrained teachers, pull student at all steps. ~$0.30/trace with VOI gating. Novel.
+v0.1 runs both. v0.0 (current) tests trace-replay alone vs. plain GRPO to falsify the novel claim cheaply.
 ### 3. Recommended stack (verified across all 5 reports)

docs/COMPOSER_RECIPE_MAPPING.md ADDED Viewed

	@@ -0,0 +1,155 @@

+# Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping
+> **Audit date:** 2026-05-25 (post-hoc, after the parallel research dispatch).
+> **Methodology:** Read [Cursor's blog](https://cursor.com/blog/composer-2-5) directly (`mcp_tavily_tavily_extract` advanced mode), then audit `research/01-composer-2.5.md` and `framework/composer-replication-framework.md` against ground truth. Mark every claim as either **`[BLOG-VERIFIED]`** (in the blog), **`[INFERRED]`** (reasonable extrapolation from blog + base-model knowledge), or **`[EXTRAPOLATED]`** (subagent added it, likely correct but not in the blog).
+This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a *high* level but did not rigorously map each Composer stage onto the spike plan.
+## Composer 2.5's published recipe (5 components, blog-verified)
+The Cursor blog discusses **only three** training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.
+### 1. **Targeted RL with Textual Feedback** `[BLOG-VERIFIED]`
+> *"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."* — Cursor blog
+**Mechanism, exactly:**
+- **Same model** acts as both teacher and student. Not two separate models.
+- The teacher is "the policy at this turn, *with* a hint inserted into the context."
+- The student is "the policy at this turn, *without* the hint" (the original context).
+- Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`, applied **only at the problematic turn**, not over the full trajectory.
+- Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.
+**Cited prior art** (Cursor's footnote 1):
+- **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., 2026, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), [GitHub: siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.
+- **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., 2026, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is **mathematically the same** as Composer's targeted-textual-feedback method. **There is published code.** Comparison table from the SDPO paper:
+  | Method | Sampling | Signal | Feedback |
+  |---|---|---|---|
+  | SFT / Distillation (Hinton 2015) | off-policy | rich | strong teacher |
+  | On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher |
+  | RLVR / GRPO (Lambert 2025) | on-policy | weak | environment |
+  | **SDPO (this paper / Composer)** | **on-policy** | **rich** | **environment** |
+- **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).
+**Key reproducibility gap (still unsolved):** *How are the hints generated?* The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. **This is the single most important open question for replication.**
+### 2. **Synthetic data at 25× scale** `[BLOG-VERIFIED]`
+> *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* — Cursor blog
+- **Feature Deletion** is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
+- The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a **real risk**, not theoretical.
+- "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.
+### 3. **Sharded Muon + dual mesh HSDP** `[BLOG-VERIFIED]`
+> *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights."* — Cursor blog
+- Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
+- Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
+- Optimizer step time on 1T model: **0.2 s**.
+This is **infrastructure, not algorithm**. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.
+## What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`)
+`research/01-composer-2.5.md` introduced these claims that are **NOT in the Cursor blog**. Most are likely correct from secondary sources, but they are not blog-verified.
+| Claim | Source basis | Verdict |
+|---|---|---|
+| "85% of total compute is post-training" | `[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). | **Plausible but unverified.** Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. |
+| Anyrun environment harness with LSP/file-I/O/terminal | `[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). | **Plausible** — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. |
+| MLA + 1T/32B active + 384 experts + 256K ctx | `[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". | **Verified independently** via [Moonshot's K2.5 model card](https://huggingface.co/moonshotai). Correct. |
+| CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | `[EXTRAPOLATED]` — blog doesn't quote benchmarks. | **Source unclear.** Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. |
+| "PPO or GRPO variant" | `[EXTRAPOLATED]` — blog never names the RL algorithm. | **Educated guess.** Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits *on top of* an unspecified RLVR algorithm, so this is still open. |
+| "Continued pretraining on heavily code-weighted data" | `[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). | Verified. |
+| "Behavioral aspects: communication style, effort calibration" | `[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them. | Verified, but blog doesn't say *how* they're trained. The targeted-textual-feedback method is presumably also used here. |
+## Mapping each blog component → our replication framework
+| Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 |
+|---|---|---|---|---|---|
+| **(a)** Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B` | ✗ | ✗ | ✗ |
+| **(b)** Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite |
+| **(c)** Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 | ✓ baseline | ✓ + DAPO patches | + decentralized rollouts |
+| **(d)** Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | **Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD`** (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). | ✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator |
+| **(e)** Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers |
+| **(f)** Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) |
+| **(g)** Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env | ✗ (small surface) | ✓ (build the monitor) | + RM-based penalty |
+## Critical relationship: Composer hint-distill vs. trace-replay-distill
+These are **two different mechanisms**, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.
+| Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) |
+|---|---|---|
+| Number of models | **1** (same model is teacher + student) | **N+1** (frozen N teachers + 1 trainable student) |
+| What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs |
+| Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) |
+| Privileged information | Hint text in context | None — teachers see same state student sees |
+| Source of hint / privileged info | **Open question.** Templates? LLM judge? | Not applicable |
+| Relationship to RLVR | Adds dense per-turn signal *on top of* RLVR scalar reward | Same — adds dense per-step signal on top of RLVR |
+| Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) |
+| Published code? | **Yes — `siyan-zhao/OPSD` (MIT)** | Not yet — we're building it |
+| Novel in the framework? | No — this is Composer's published recipe | **Yes — the v0.0 research bet** |
+**Both channels stack on the same RLVR base.** The full v0.1 trainer has THREE reward channels:
+1. **RLVR** (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
+2. **Composer hint-distill** = SDPO loss (one extra forward pass per error site, hint-conditioned).
+3. **Trace-replay-distill** = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).
+In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.
+## Why deferring Composer hint-distill to v0.1 is the right call
+I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:
+1. **The novel claim is trace-replay.** The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
+2. **The hint-generator open question is unresolved.** Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
+3. **Spike 001's economic verdict only gates the trace-replay channel.** SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
+4. **A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm.** Not worth it for v0.0.
+v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.
+## Implementation handles for v0.1 (concrete starting points)
+When we get to v0.1, the **Composer hint-distill channel** has a clear engineering path:
+1. **Lift the SDPO loss math from `siyan-zhao/OPSD`.** MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
+2. **Hint generator v1: hardcoded templates.** Pattern-match on tool-call errors:
+   - `"Tool not found: X"` → hint = `"Reminder: Available tools are: <list of valid tools>"`
+   - `"JSONDecodeError: ..."` → hint = `"Reminder: tool arguments must be valid JSON"`
+   - `"Type error in args"` → hint = `"Reminder: <tool-name> expects args matching schema: <schema>"`
+   This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
+3. **Apply only at error sites,** not every turn. Detect via:
+   - Failed tool calls (status != ok)
+   - Exception traces in tool output
+   - Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
+4. **Loss = `α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss`.** Ablate `(α, β, γ)`.
+## Implementation handles for v0.2 (decentralized scale)
+If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:
+- **Continued pretraining:** Pretrained checkpoint already exists (Qwen3-32B); skip.
+- **Synthetic data:** Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
+- **Realistic-env RL:** PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
+- **Targeted hint-distill:** Compute is local to each trainer — no decentralization complication.
+- **Trace-replay-distill:** Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
+- **Sharded Muon / dual-mesh HSDP:** Only if we adopt MoE base. For dense 32B, FSDP2 is fine.
+## Citations (updated)
+Primary sources for each Composer-2.5 component, post-audit:
+- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026)
+- **Cursor blog** — [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report) (predecessor; named the "Anyrun" environment per subagent — verify if needed)
+- **OPSD paper** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs*, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD). MIT.
+- **SDPO paper** — Hübotter et al., *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
+- **Self-Distillation continual-learning** — [arXiv:2601.19897](https://arxiv.org/abs/2601.19897). Cited by Cursor; less directly relevant.
+- **Moonshot Kimi K2.5** — base model, [HF model card](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
+The methodology mapping in this document supersedes vague claims in `research/01-composer-2.5.md` where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.

framework/composer-replication-framework.md CHANGED Viewed

@@ -13,7 +13,7 @@
 | **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
 | **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
 | **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
-| **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Targeted hint distillation** (Composer's secret sauce), (3) **Trace-replay multi-teacher PRM** (your novel idea) | Composer proved (1)+(2) work; (3) is genuinely novel and stacks cleanly |
 | **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
 | **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
@@ -119,12 +119,12 @@ From `05-trace-replay-distillation.md`:
 3. Get N candidate `action_t` distributions.
 4. Use disagreement / agreement as a **per-step reward signal** for the student model.
-**This stacks beautifully with Composer's hint-distillation.** Composer's hint-distill is "when student errs, generate hint, pull student toward hint-conditioned-self." Trace-replay-distill is "at every step, pull student toward the consensus of N teachers." Together:
-- Composer's hint-loss = **teacher-self pulls student** at error sites
-- Trace-replay-loss = **N external teachers pull student** at all sites (or high-uncertainty sites with VOI gating)
-These are *complementary*, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem.
 **Cost mitigation** (the report does this analysis well):
 - VOI gating (only query teachers when student entropy is high) → 60-80% savings

 | **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
 | **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
 | **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
+| **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. |
 | **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
 | **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
 3. Get N candidate `action_t` distributions.
 4. Use disagreement / agreement as a **per-step reward signal** for the student model.
+**This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together:
+- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
+- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
+These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
 **Cost mitigation** (the report does this analysis well):
 - VOI gating (only query teachers when student entropy is high) → 60-80% savings

research/01-composer-2.5.md CHANGED Viewed

@@ -1,5 +1,15 @@
 # Cursor Composer 2.5: Deep Research Report
 ## Overview
 Cursor's Composer 2.5 is an advanced agentic coding model that powers the Cursor IDE. Released in mid-May 2026, it represents a massive leap in agentic capabilities, particularly for long-running, multi-file software engineering tasks. While the base weights are Moonshot AI's open-source **Kimi K2.5** model, roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline.

 # Cursor Composer 2.5: Deep Research Report
+> **⚠️ Audit notice (added 2026-05-25, post-hoc):** This file is a snapshot of the parallel-research dispatch output (Gemini 3.1 Pro, ~10-min web-research subagent). It was **not** rigorously cross-checked against the Cursor blog at write-time. A rigorous audit and stage-by-stage mapping was added later at [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md). When this file and that file conflict, **trust the mapping document** — it was written after directly reading the Cursor blog with `tavily_extract`. This file is preserved unchanged for research provenance.
+>
+> Specific claims in this file that are **NOT** in the Cursor blog (and are extrapolated from secondary sources):
+> - "85% of total compute is post-training" — community consensus, not Cursor-stated
+> - "Anyrun" environment harness with LSP/file-I/O/terminal — likely from the Composer 2 report, not 2.5
+> - "CursorBench 69.3%, Terminal-Bench 2.0 parity" — not in the 2.5 blog
+> - "PPO or GRPO variant" — blog never names the RL algorithm
+>
+> The targeted-textual-feedback method is correctly described, but this file does **not** cite the three self-distillation papers Cursor cites in footnote 1 (OPSD `arXiv:2601.18734`, SDPO `arXiv:2601.20802`, Self-Distillation Continual Learning `arXiv:2601.19897`). The mapping document does.
 ## Overview
 Cursor's Composer 2.5 is an advanced agentic coding model that powers the Cursor IDE. Released in mid-May 2026, it represents a massive leap in agentic capabilities, particularly for long-running, multi-file software engineering tasks. While the base weights are Moonshot AI's open-source **Kimi K2.5** model, roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline.

spikes/README.md CHANGED Viewed

@@ -23,13 +23,22 @@
 ## Out of scope for v0.0 (deferred to v0.1)
-- Composer's hint-distillation loss (the per-turn KL from a hint-conditioned forward pass)
-- The Feature Deletion environment (use SWE-bench-lite as the env)
 - DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
 - Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
 - MoE base (use dense Qwen3-7B; saner v0.0 target)
 - VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
 ## Budget
 | Item | Estimate | Source |

 ## Out of scope for v0.0 (deferred to v0.1)
+- **Composer hint-distill = SDPO/OPSD** (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. **Code is published** at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD); paper [arXiv:2601.20802](https://arxiv.org/abs/2601.20802). Lift the loss for v0.1 — see `docs/COMPOSER_RECIPE_MAPPING.md` § "Implementation handles for v0.1" for the concrete plan.
+- The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
 - DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
 - Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
 - MoE base (use dense Qwen3-7B; saner v0.0 target)
 - VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
+**Why deferring SDPO/hint-distill to v0.1 is the right call:**
+1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
+2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
+3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
+4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.
+v0.1 will run a 3-arm A/B: **RLVR** vs. **RLVR + SDPO** vs. **RLVR + SDPO + trace-replay-DPO** at 32B once we know v0.0's trace-replay verdict.
 ## Budget
 | Item | Estimate | Source |