Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
research: Composer 2.5 data-gen + targeted-textual-feedback deep-research wave
Browse filesPhase-3 of the deep-work-loop bringing Composer 2.5's dataset-generation and
targeted-RL-with-textual-feedback methods into the framework. 5 new research
docs (130KB), all with primary-source citations:
- 06-feature-deletion-datagen.md: Feature Deletion env design (online pass-rate
difficulty gate, reward-hacking safeguards, 5 OSS substrates w/ HF ids +
licenses, deletion mechanics, FeatureDeletionEnv + TRL reward_fn adapter).
- 07-sdpo-hint-generator.md: layered HintGenerator (template -> raw-error ->
LLM-judge -> introspection -> learned -> SDPO sibling-bootstrap), actual
template strings, judge prompt, slots into existing CollatorConfig hook.
- 08-sdpo-grpo-integration.md: ComposerGRPOTrainer(GRPOTrainer) design adding
SDPO KL at error turns on top of Dr. GRPO; PRIME-RL recipe blocked (log-probs
only), TRL subclass is the host; CPU smoke plan.
- 09-composer-blog-delta-2026.md: blog re-extraction delta; found the Composer 2
arXiv tech report; SDPO successful-rollout-as-implicit-feedback lever.
- 10-composer2-techreport-mining.md: arXiv:2603.24477 mined. RESOLVED RL algo =
Dr. GRPO (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE
router-replay). Hint-gen confirmed ABSENT from every Cursor artifact ->
SDPO/OPSD reconstruction is the only path. Corrections: optimizer Adam not
Muon; sharding FSDP+CP+decoupled-EP not HSDP. Hint-free behavior-shaping
alternative (aux scalar rewards + nonlinear length/effort penalty eq).
|
@@ -0,0 +1,346 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Feature-Deletion Data Generation → `FeatureDeletionEnv` Design Brief
|
| 2 |
+
|
| 3 |
+
> **Author date:** 2026-05-28.
|
| 4 |
+
> **Scope:** Turn Composer 2.5's *Feature Deletion* synthetic-task approach (component **#2 "Synthetic data at 25× scale"**, mapping row **(b)**, reward-hacking row **(g)**) into a real, usable data-generation subsystem for this framework. This is the design brief that mapping-table row (b) calls for ("Build 1 generator (Feature Deletion) as OpenEnv-compatible env").
|
| 5 |
+
> **Method:** Live blog re-extraction (`mcp_tavily_tavily_extract` advanced) of [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5); substrate-dataset cards pulled live from HF/arXiv; TRL `GRPOTrainer` reward-fn convention confirmed against the [TRL source](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py).
|
| 6 |
+
> **Tag convention** (matches `docs/COMPOSER_RECIPE_MAPPING.md`): **`[BLOG-VERIFIED]`** = verbatim in the 2.5 blog; **`[INFERRED]`** = reasonable extrapolation from blog + open-source prior art; **`[EXTRAPOLATED]`** = our design addition, not Cursor-stated.
|
| 7 |
+
> **Reads-before:** `docs/COMPOSER_RECIPE_MAPPING.md` (§2, rows b/g) and `research/09-composer-blog-delta-2026.md` (online-curriculum delta). This file does **not** re-derive the Targeted-RL / SDPO material (that is rows (d) and `research/05`); it is the data-gen side only.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 0. TL;DR
|
| 12 |
+
|
| 13 |
+
Feature Deletion is a **self-verifying inverse task**: take a repo whose test suite passes, *programmatically remove* a testable feature (so the suite now fails), and reward the agent for reimplementing it until the suite passes again. The reward is the pre-existing test suite — **verifiable, no human labels, no golden patch needed at reward time**. We can stand this up immediately on five open substrates (SWE-Gym, SWE-bench-Lite, R2E-Gym, SWE-rebench, OpenHands/Nemotron trajectories) by *inverting* their `(repo, base_commit, gold_patch, test_patch)` tuples instead of generating deletions from scratch. The two non-obvious requirements the blog forces on us: (1) an **online pass-rate difficulty gate** (the curriculum is dynamic, not a static bank — per the delta note), and (2) **anti-reward-hacking sandboxing** because Cursor observed the model recovering deleted signatures from bytecode/type-check caches. Below: a `FeatureDeletionEnv` Gym/OpenEnv class sketch wired for TRL `GRPOTrainer` (reward = test pass-fraction), the deletion mechanics (AST/file/coverage-mapped), the sandbox lockdown spec, and a CPU-pool cost model.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 1. What "Feature Deletion" is, exactly `[BLOG-VERIFIED]`
|
| 18 |
+
|
| 19 |
+
Verbatim from the Synthetic-data section of the blog (re-pulled 2026-05-28):
|
| 20 |
+
|
| 21 |
+
> *"During RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, **we both select for and create harder tasks dynamically throughout the run**. Composer 2.5 is trained with **25x more synthetic tasks** than Composer 2.*
|
| 22 |
+
>
|
| 23 |
+
> *We use a range of approaches for creating synthetic tasks that are grounded in real codebases. For example, one synthetic approach is **feature deletion**. For these tasks the agent is given a codebase with a large set of tests, and asked to **delete code and files in such a way that the codebase remains functional while specific testable features are removed**. The synthetic task is to **reimplement the feature, and the tests are used as a verifiable reward**."*
|
| 24 |
+
|
| 25 |
+
**Parse of the mechanism (note the two-agent / two-phase structure the blog implies):**
|
| 26 |
+
|
| 27 |
+
| Phase | Actor | Action | Output |
|
| 28 |
+
|---|---|---|---|
|
| 29 |
+
| **Deletion (task-construction)** | a *deleter* (model or program) | "delete code and files in such a way that the codebase remains functional while specific testable features are removed" | a `broken_repo` + the set of tests that now fail |
|
| 30 |
+
| **Reimplementation (the training task)** | the *policy under training* | reimplement the deleted feature | a diff scored by the test suite |
|
| 31 |
+
|
| 32 |
+
- The deletion step is itself non-trivial: it must keep the codebase *otherwise functional* (imports resolve, unrelated tests still pass) while making **specific testable features** fail. `[BLOG-VERIFIED]` that this constraint exists; `[INFERRED]` that in practice this means *partition the test suite into a kept set (`PASS_TO_PASS`) that must still pass and a target set (`FAIL_TO_PASS`) that the deletion must break.*
|
| 33 |
+
- **Verifiable reward = the original test suite.** No golden patch is needed at reward time (only at task-construction time, to know what "done" looks like). This is the key property that makes the env cheap to run in an RL loop.
|
| 34 |
+
- The blog does **not** state: how deletion targets are *selected*, the deleter model, the languages beyond the Python/Java implied by the reward-hacking examples, or the difficulty heuristic. Those are the reproducibility gaps (consistent with `research/09` §1 "NO CHANGE" line).
|
| 35 |
+
|
| 36 |
+
**Relationship to the inverse-of-SWE-bench framing** `[INFERRED]`: A SWE-bench-style instance is `(repo@base_commit, problem_statement, gold_patch, test_patch, FAIL_TO_PASS, PASS_TO_PASS)`. Feature Deletion is the *constructive inverse*: instead of mining a human PR that fixed a bug, we **apply `revert(gold_patch)` (or an AST-deletion) to a passing repo** to manufacture the broken state, then ask the agent to re-derive `gold_patch`. This means **every existing SWE-* instance is already a ready-made Feature-Deletion task** — we get the deletion "for free" by reverting the gold patch. This is the single most important leverage point in this brief (see §4, §5).
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 2. The online difficulty curriculum `[BLOG-VERIFIED]` framing → `[EXTRAPOLATED]` design
|
| 41 |
+
|
| 42 |
+
The delta note (`research/09` §1, DELTA-new-emphasis) is explicit that *"select for and create harder tasks dynamically throughout the run"* is a **dynamic curriculum / online task-selection** signal, not a static bank. Our generator must therefore expose a **pass-rate gate**, not just emit tasks. Design:
|
| 43 |
+
|
| 44 |
+
**Difficulty signal.** For each candidate task `t`, maintain an EMA of the policy's group pass-rate `p̂(t)` (TRL GRPO already samples `G` completions per prompt — we get `G` pass/fail observations per task per step *for free*). Define difficulty `d(t) = 1 − p̂(t)`.
|
| 45 |
+
|
| 46 |
+
**Two levers the blog names — "select for" and "create":**
|
| 47 |
+
|
| 48 |
+
1. **`select for` (online filter / replay weighting)** `[EXTRAPOLATED]`. Sampling weight over the task pool:
|
| 49 |
+
- **Drop solved tasks:** if `p̂(t) > τ_easy` (e.g. 0.9) for `k` consecutive evaluations, retire `t`. This is exactly the blog's "begins to get most training problems correct" symptom.
|
| 50 |
+
- **Drop impossible tasks:** if `p̂(t) < τ_hard` (e.g. 0.02) after `k` exposures, quarantine `t` (likely a broken-task or reward-hack-only task — see §3).
|
| 51 |
+
- **Up-weight the frontier:** sample weight `w(t) ∝ p̂(t)·(1−p̂(t))` (max variance ≈ max learning signal; standard curriculum-RL choice, cf. PLR / TD-error curricula). Keeps the policy on tasks it solves ~50% of the time.
|
| 52 |
+
2. **`create` (difficulty escalation)** `[EXTRAPOLATED]`. When the pool's median `p̂` rises above a band, the *generator* produces harder tasks. Concrete escalation knobs, easiest→hardest:
|
| 53 |
+
- **Deletion span:** single-function → whole-class → whole-file → cross-file feature (more `FAIL_TO_PASS` tests, more LoC to reconstruct).
|
| 54 |
+
- **Hint starvation:** strip docstrings / type hints / the deleted function's signature from the surrounding context (also a reward-hack-surface reduction, §3).
|
| 55 |
+
- **Coupling:** delete a feature that several `PASS_TO_PASS` tests *also* exercise, so the agent must reconstruct it without breaking neighbors.
|
| 56 |
+
- **Multi-feature:** delete `n>1` independent features in one repo (reward = fraction of target tests passing — naturally graded).
|
| 57 |
+
|
| 58 |
+
**Implementation handle (curriculum is a thin layer over the task pool):**
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
# datagen/curriculum.py [EXTRAPOLATED]
|
| 62 |
+
import math, random, collections
|
| 63 |
+
|
| 64 |
+
class PassRateCurriculum:
|
| 65 |
+
"""Online difficulty gate. Fed (task_id, n_pass, n_total) after each GRPO group."""
|
| 66 |
+
def __init__(self, tau_easy=0.90, tau_hard=0.02, ema=0.3, retire_k=3):
|
| 67 |
+
self.p = collections.defaultdict(lambda: 0.5) # EMA pass-rate
|
| 68 |
+
self.seen = collections.Counter()
|
| 69 |
+
self.retired, self.quarantined = set(), set()
|
| 70 |
+
self.tau_easy, self.tau_hard, self.ema, self.retire_k = tau_easy, tau_hard, ema, retire_k
|
| 71 |
+
|
| 72 |
+
def update(self, task_id, n_pass, n_total):
|
| 73 |
+
r = n_pass / max(n_total, 1)
|
| 74 |
+
self.p[task_id] = (1 - self.ema) * self.p[task_id] + self.ema * r
|
| 75 |
+
self.seen[task_id] += 1
|
| 76 |
+
if self.seen[task_id] >= self.retire_k:
|
| 77 |
+
if self.p[task_id] > self.tau_easy: self.retired.add(task_id)
|
| 78 |
+
elif self.p[task_id] < self.tau_hard: self.quarantined.add(task_id) # likely broken / hack-only
|
| 79 |
+
|
| 80 |
+
def weight(self, task_id):
|
| 81 |
+
if task_id in self.retired or task_id in self.quarantined: return 0.0
|
| 82 |
+
p = self.p[task_id]
|
| 83 |
+
return p * (1 - p) + 1e-3 # frontier (max-variance) weighting
|
| 84 |
+
|
| 85 |
+
def sample(self, task_ids, k):
|
| 86 |
+
live = [t for t in task_ids if self.weight(t) > 0]
|
| 87 |
+
w = [self.weight(t) for t in live]
|
| 88 |
+
return random.choices(live, weights=w, k=k) if live else []
|
| 89 |
+
|
| 90 |
+
def median_pass(self, task_ids): # escalation trigger for the generator
|
| 91 |
+
vals = sorted(self.p[t] for t in task_ids if t not in self.retired)
|
| 92 |
+
return vals[len(vals)//2] if vals else 0.0
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
The trainer feeds `update(...)` from each GRPO group; the generator polls `median_pass(...)` and, when it crosses a band, emits a harder batch (more deletion span / more starvation). This is the minimal realization of "select for + create harder tasks dynamically."
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## 3. Reward-hacking failure modes & programmatic safeguards `[BLOG-VERIFIED]` problem, `[EXTRAPOLATED]` mitigations
|
| 100 |
+
|
| 101 |
+
The blog (re-pulled verbatim) is the ground truth on the *problem*:
|
| 102 |
+
|
| 103 |
+
> *"One downstream consequence of large scale synthetic task creation is that it can cause unexpected reward hacking… In one example, the model found a **leftover Python type-checking cache and reverse-engineered the format to find a deleted function signature**. In another, it was able to **find and decompile Java bytecode to reconstruct a third-party API**. We were able to find and diagnose these problems using **agentic monitoring tools**, but they demonstrate the increasing care necessary for large scale RL."*
|
| 104 |
+
|
| 105 |
+
The blog gives **no mitigation specifics** beyond "agentic monitoring tools" (confirmed unchanged in `research/09` §4). So the following are our design `[EXTRAPOLATED]`, consistent with mapping row (g) ("Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find`/`strings`/`unzip`").
|
| 106 |
+
|
| 107 |
+
**Root cause:** Feature Deletion deletes *source*, but compilers/type-checkers/build tools leave **shadow copies of the deleted information** elsewhere in the working tree. The agent recovers the answer instead of reconstructing it. Two defense layers:
|
| 108 |
+
|
| 109 |
+
### 3a. Pre-task scrubbing (eliminate the leak at construction time)
|
| 110 |
+
Run after deletion, before the repo is handed to the agent:
|
| 111 |
+
|
| 112 |
+
| Leak source | Scrub action |
|
| 113 |
+
|---|---|
|
| 114 |
+
| Python bytecode | delete all `**/__pycache__/`, `*.pyc`, `*.pyo` |
|
| 115 |
+
| Type-check caches | delete `.mypy_cache/`, `.pyre/`, `.pytype/`, `.dmypy.json`, `.ruff_cache/`, `.pytest_cache/` |
|
| 116 |
+
| Compiled Java/JVM | delete `*.class`, `target/`, `build/`, `*.jar`/`*.war` containing the deleted API; strip bundled deps |
|
| 117 |
+
| Build/dist artifacts | delete `dist/`, `*.egg-info/`, `*.so`, `build/`, `.tox/` |
|
| 118 |
+
| VCS history | run the agent on a **squashed, detached worktree** — no `.git` (else `git log -p` / `git show` recovers the deletion) |
|
| 119 |
+
| Editor/LSP indexes | delete `.idea/`, `.vscode/`, `*.code-workspace`, ctags/`tags`, `.cache/` |
|
| 120 |
+
| Docs/stubs | delete generated `*.pyi` stubs and built HTML/Sphinx docs that embed signatures |
|
| 121 |
+
|
| 122 |
+
```python
|
| 123 |
+
# datagen/scrub.py [EXTRAPOLATED]
|
| 124 |
+
import shutil, pathlib
|
| 125 |
+
LEAK_DIRS = {"__pycache__",".mypy_cache",".pyre",".pytype",".ruff_cache",
|
| 126 |
+
".pytest_cache","target","build","dist",".tox",".idea",".vscode",".git",".cache"}
|
| 127 |
+
LEAK_GLOBS = ["*.pyc","*.pyo","*.class","*.jar","*.war","*.so","*.pyi",
|
| 128 |
+
".dmypy.json","tags","*.egg-info"]
|
| 129 |
+
def scrub(root: str):
|
| 130 |
+
root = pathlib.Path(root)
|
| 131 |
+
for p in root.rglob("*"):
|
| 132 |
+
if p.is_dir() and p.name in LEAK_DIRS:
|
| 133 |
+
shutil.rmtree(p, ignore_errors=True)
|
| 134 |
+
for g in LEAK_GLOBS:
|
| 135 |
+
for p in root.rglob(g):
|
| 136 |
+
(shutil.rmtree(p, ignore_errors=True) if p.is_dir() else p.unlink(missing_ok=True))
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### 3b. Runtime sandbox lockdown (block recovery if a leak survives)
|
| 140 |
+
- **Tool denylist in the agent's shell harness** (matches mapping row g): no `find`, `strings`, `unzip`, `jar`, `javap`, `unzip`, `objdump`, `grep -r` over non-source dirs, `uncompyle6`/`decompyle3`/`pycdc`, `cfr`/`procyon`/`fernflower` (Java decompilers), `git`. Implement as an allowlisted command shim, not a blocklist (allowlist is the safe default).
|
| 141 |
+
- **Network egress = none** (can't `pip download` the original package to read the API). Already required for determinism.
|
| 142 |
+
- **Read-only mounts for everything except the editable source tree**; site-packages of the *target package itself* removed from the image.
|
| 143 |
+
|
| 144 |
+
### 3c. Post-hoc monitoring ("agentic monitoring tools" analogue) `[EXTRAPOLATED]`
|
| 145 |
+
A cheap programmatic monitor over the trajectory, run *after* a rollout passes, to retro-reject hacks:
|
| 146 |
+
- **AST diff check:** the agent's accepted diff must contain *new function/class bodies* (AST nodes with statements), not just an import that re-exposes a surviving symbol. Reject solutions whose passing is explained purely by `import`/`from … import *` of a non-scrubbed module.
|
| 147 |
+
- **Provenance scan:** flag any trajectory whose tool calls touched `*.class`, `*.pyc`, `.mypy_cache`, `.git`, or invoked a denied binary (defense-in-depth telemetry even with the shim).
|
| 148 |
+
- **Static byte-similarity gate:** if the agent's reconstructed function is a near-exact byte copy of the (held-out) gold body *and* the agent never "wrote" it incrementally (single paste), flag for review — distinguishes reconstruction from recovery.
|
| 149 |
+
- These produce a **reward mask**: `reward = test_pass_fraction × (0 if hack_detected else 1)`. This is the concrete realization of mapping row (g)'s "+ RM-based penalty" without needing a learned RM in v0.1.
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## 4. Open-source drop-in substrates
|
| 154 |
+
|
| 155 |
+
Every substrate below ships SWE-bench-shaped tuples `(repo, base_commit, patch=gold, test_patch, FAIL_TO_PASS, PASS_TO_PASS)`. **The Feature-Deletion mapping is identical for all of them: revert `patch` (or AST-delete the functions it touches) to manufacture `broken_repo`; `FAIL_TO_PASS` is the reward target; `PASS_TO_PASS` is the "stay-functional" guard the blog demands.** Licenses verified live 2026-05-28.
|
| 156 |
+
|
| 157 |
+
| Substrate | HF dataset id | Scale | What it provides | License | FD-env mapping |
|
| 158 |
+
|---|---|---|---|---|---|
|
| 159 |
+
| **SWE-bench / Lite / Verified** | `SWE-bench/SWE-bench`, `SWE-bench/SWE-bench_Lite`, `SWE-bench/SWE-bench_Verified` | 2,294 / 534 / 500 | Real GitHub issue→PR tuples, per-version test envs, pre-built Docker images (`xingyaoww/sweb.eval.*`, `swebench/*`) | dataset CC-BY-4.0; **per-repo source licenses vary** (mostly Apache/MIT/BSD) | Lite/Verified are the **v0.0 smoke-test set** (mapping row b: "use SWE-bench-lite only" in v0.0). Revert gold patch → FD task. Already-built images = no env-construction cost. |
|
| 160 |
+
| **SWE-Gym / SWE-Gym-Raw** | `SWE-Gym/SWE-Gym`, `SWE-Gym/SWE-Gym-Raw` | 2,438 (11 repos) / ~tens-of-k raw | Same schema as SWE-bench but **purpose-built for training** (train split, not a held-out benchmark → no contamination worry); pre-built Docker images; verifier-training support. arXiv:2412.21139 (ICML 2025). | check repo (SWE-Gym tooling MIT; **instances inherit upstream repo licenses**) | **Primary v0.1 FD substrate** (mapping row b: "build Feature Deletion"). 2.4k clean training tasks, each invertible into an FD task with `n` difficulty escalations (§2). |
|
| 161 |
+
| **R2E-Gym (V1 / Subset)** | `R2E-Gym/R2E-Gym-V1`, `R2E-Gym/R2E-Gym-Subset`, `R2E-Gym/SWE-Bench-Lite` | **8.1K** executable envs (13 repos); Subset = non-overlapping w/ SWE-bench | **SWE-GEN engine**: procedurally generates executable envs *directly from commits* w/o human issues, + execution-assisted back-translation for problem statements + **pre-built Docker images**. arXiv (R2E-Gym, Jain et al. 2025). | check repo (Apache-2.0 tooling typical; per-instance upstream licenses) | Best **scale** substrate and the closest open analogue to Composer's "grounded in real codebases" generator. Its commit-diffs *are* feature-deletion candidates by construction (the commit added a feature; revert = delete it). |
|
| 162 |
+
| **SWE-rebench** | `nebius/SWE-rebench` (+ `nebius/SWE-bench-extra`, `nebius/SWE-agent-trajectories`) | **21,336** tasks, 3,468 repos | Fully-automated mining pipeline; ships `install_config`, `requirements`, `environment`, `docker_image`, and **LLM-scored difficulty + clarity annotations** per task; `FAIL_TO_PASS`/`PASS_TO_PASS`/`FAIL_TO_FAIL`/`PASS_TO_FAIL`. arXiv:2505.20411 (NeurIPS 2025). | **dataset CC-BY-4.0**; per-instance `license_name` field provided (56 distinct) — *honor it per instance* | **Largest + the only one with built-in difficulty scores** → seeds the §2 curriculum's cold-start `p̂(t)` prior before any rollouts exist. The per-instance `docker_image` + `install_config` removes the 200-hr env-build bottleneck SWE-Gym reported. |
|
| 163 |
+
| **OpenHands trajectories** (via Nemotron-SWE-v1) | `nvidia/Nemotron-SWE-v1` | 59K agent trajectories | OpenHands-framework SWE trajectories (Qwen3-Coder-480B teacher), issues sourced from SWE-Gym + R2E-Gym-Subset | **CC-BY-4.0** (subsets BSD-3 / Apache-2.0 / MIT per viewer) — "ready for commercial use" | Not an FD-env itself — it's **SFT/distill warm-start + a source of gold trajectories** for the §3c monitor's "what legitimate reconstruction looks like" reference, and for `research/05` trace-replay. Use as cold-start, not as the RL env. |
|
| 164 |
+
|
| 165 |
+
**Practical selection:** v0.0 = SWE-bench-Lite (smoke). v0.1 = SWE-Gym (clean train) + SWE-rebench (scale + difficulty prior). R2E-Gym when we need to push past ~25k tasks toward the "25×" spirit. Nemotron/OpenHands trajectories = SFT warm-start + monitor reference, not the RL env. **License rule baked into the loader:** carry each instance's upstream repo license; filter out copyleft (GPL/AGPL) repos for any artifact we redistribute (we redistribute *deletions/diffs*, which are derivative works).
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## 5. Deletion mechanics: producing the `(broken_repo, test_command, golden_diff)` tuple
|
| 170 |
+
|
| 171 |
+
Two construction paths; we implement both and let the curriculum pick granularity (§2).
|
| 172 |
+
|
| 173 |
+
### Path A — Gold-patch reversion (cheap, the default for SWE-* substrates) `[INFERRED]`
|
| 174 |
+
The substrate already tells us *exactly* which lines implement a testable feature: the gold `patch`. So:
|
| 175 |
+
1. `git apply patch` onto `base_commit` → **functional repo, all tests pass** (this is the substrate's "solved" state).
|
| 176 |
+
2. `golden_diff := patch` (what the agent must re-derive); `broken_repo := apply(reverse(patch))` → the feature is gone.
|
| 177 |
+
3. `FAIL_TO_PASS` tests now fail (target); `PASS_TO_PASS` tests still pass (the "remains functional" guard — verify this, see §5c).
|
| 178 |
+
4. **Scrub** (§3a), strip `.git`, freeze image.
|
| 179 |
+
|
| 180 |
+
### Path B — Coverage-mapped AST deletion (true synthetic generation, no human PR needed) `[EXTRAPOLATED]`
|
| 181 |
+
This is the path that generalizes beyond mined PRs and lets us "create harder tasks" at will (R2E-Gym-style):
|
| 182 |
+
1. **Run the suite with coverage** (`coverage.py` / `pytest --cov`) on the passing repo to get a `test → {file:line-ranges}` map.
|
| 183 |
+
2. **Pick a deletion target** by granularity knob:
|
| 184 |
+
- *function-level:* parse with `ast`/`libcst`, choose a `FunctionDef`/`AsyncFunctionDef`/`ClassDef` whose lines are covered by ≥1 test and that has high *test selectivity* (covered by few `PASS_TO_PASS` so the repo stays functional after removal).
|
| 185 |
+
- *file-level:* a module imported by exactly the target tests.
|
| 186 |
+
- *feature-level:* the transitive closure of a public symbol via an import/def graph (`grimp`/`pydeps`), bounded so unrelated tests survive.
|
| 187 |
+
3. **Delete** via CST (replace body with `raise NotImplementedError` *or* remove the node and its now-dead imports). CST (`libcst`) preserves formatting and lets us re-insert a stub signature or not (the §2 "hint starvation" knob).
|
| 188 |
+
4. **`golden_diff` = the removed nodes** (held out for the monitor; never shown to the agent).
|
| 189 |
+
|
| 190 |
+
### 5c. Guaranteeing the remaining tests exercise the deleted code (the blog's hard constraint)
|
| 191 |
+
The blog requires *"the codebase remains functional while specific testable features are removed."* Enforce as a **construction-time validation gate** — a task is only emitted if all four hold:
|
| 192 |
+
|
| 193 |
+
```python
|
| 194 |
+
# datagen/build_task.py [EXTRAPOLATED] (pseudocode over a sandboxed runner)
|
| 195 |
+
def validate_task(repo_passing, broken_repo, target_tests, keep_tests, run):
|
| 196 |
+
# 1. baseline sanity: full suite passes on the unbroken repo
|
| 197 |
+
assert run(repo_passing, target_tests + keep_tests).all_pass
|
| 198 |
+
res = run(broken_repo, target_tests + keep_tests)
|
| 199 |
+
# 2. deletion actually breaks the target feature (tests now FAIL)
|
| 200 |
+
assert all(res.failed(t) for t in target_tests) # FAIL_TO_PASS non-empty & failing
|
| 201 |
+
# 3. deletion left the rest functional (collection works, neighbors pass)
|
| 202 |
+
assert res.collected_ok and all(res.passed(t) for t in keep_tests) # PASS_TO_PASS guard
|
| 203 |
+
# 4. solvability: gold diff restores green (the task is actually achievable)
|
| 204 |
+
assert run(apply(broken_repo, golden_diff), target_tests + keep_tests).all_pass
|
| 205 |
+
return Task(...) # else discard
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
Gate (4) is what prevents the §2 "impossible task" quarantine pile-up — every emitted task is provably solvable by `golden_diff`. Gate (3) is the literal encoding of "remains functional." **Task tuple emitted:**
|
| 209 |
+
|
| 210 |
+
```python
|
| 211 |
+
# datagen/schema.py [EXTRAPOLATED]
|
| 212 |
+
from dataclasses import dataclass, field
|
| 213 |
+
@dataclass(frozen=True)
|
| 214 |
+
class FeatureDeletionTask:
|
| 215 |
+
task_id: str
|
| 216 |
+
repo: str # e.g. "getmoto/moto"
|
| 217 |
+
base_commit: str
|
| 218 |
+
broken_image: str # docker tag of the scrubbed broken repo (frozen env)
|
| 219 |
+
test_command: str # e.g. "python -m pytest -q"
|
| 220 |
+
fail_to_pass: tuple[str, ...] # reward target (must go red→green)
|
| 221 |
+
pass_to_pass: tuple[str, ...] # functional guard (must stay green)
|
| 222 |
+
golden_diff: str = field(repr=False) # HELD OUT — monitor/solvability only, never in obs
|
| 223 |
+
granularity: str = "function" # function|file|feature (curriculum escalation)
|
| 224 |
+
deleted_symbols: tuple[str, ...] = () # for AST-provenance monitor (§3c)
|
| 225 |
+
upstream_license: str = "unknown" # carried from substrate; gates redistribution
|
| 226 |
+
difficulty_prior: float = 0.5 # seeded from SWE-rebench LLM score if available
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
## 6. `FeatureDeletionEnv` — OpenEnv/Gym-style design for TRL `GRPOTrainer` + verifiers
|
| 232 |
+
|
| 233 |
+
**Integration contract.** TRL's `GRPOTrainer` takes a dataset of prompts and one or more **reward functions** with the calling convention `reward_fn(prompts: list[str], completions: list[str], **kwargs) -> list[float]` (the dataset's non-prompt columns are passed through as `**kwargs`; confirmed against the [TRL `grpo_trainer.py`](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py) source and the `RewardFunc` type-alias fix in TRL PR #5246). So the env exposes **two faces**: a Gym/OpenEnv face (`reset`/`step` for multi-turn agentic rollout via OpenEnv, mapping row c) and a **`reward_fn` adapter** that GRPO calls directly. Reward = **test pass fraction** (`|FAIL_TO_PASS passing| / |FAIL_TO_PASS|`), naturally graded for multi-feature tasks, masked by the hack monitor (§3c).
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
# envs/feature_deletion_env.py [EXTRAPOLATED]
|
| 237 |
+
# Gym/OpenEnv-style env + a TRL GRPO reward adapter. Execution happens in the
|
| 238 |
+
# locked-down sandbox of §3b; this class is the orchestration shell.
|
| 239 |
+
from dataclasses import dataclass
|
| 240 |
+
from datagen.schema import FeatureDeletionTask
|
| 241 |
+
|
| 242 |
+
@dataclass
|
| 243 |
+
class StepResult:
|
| 244 |
+
observation: str # tool output / test stdout shown to the agent
|
| 245 |
+
reward: float # only nonzero on a terminal "submit" step
|
| 246 |
+
done: bool
|
| 247 |
+
info: dict
|
| 248 |
+
|
| 249 |
+
class FeatureDeletionEnv:
|
| 250 |
+
"""One task per episode. Sandbox = allowlisted shell, no net, scrubbed tree (§3)."""
|
| 251 |
+
def __init__(self, sandbox, monitor, max_turns: int = 40):
|
| 252 |
+
self.sandbox, self.monitor, self.max_turns = sandbox, monitor, max_turns
|
| 253 |
+
self.task: FeatureDeletionTask | None = None
|
| 254 |
+
|
| 255 |
+
# ---- Gym/OpenEnv face (multi-turn agentic rollout) ----
|
| 256 |
+
def reset(self, task: FeatureDeletionTask) -> str:
|
| 257 |
+
self.task, self.turns = task, 0
|
| 258 |
+
self.sandbox.boot(task.broken_image) # read-only except editable src; egress off
|
| 259 |
+
# NOTE: golden_diff / deleted_symbols are NEVER placed in the observation.
|
| 260 |
+
return self._render_prompt(task) # task desc + failing-test names + tool list
|
| 261 |
+
|
| 262 |
+
def step(self, action: dict) -> StepResult:
|
| 263 |
+
self.turns += 1
|
| 264 |
+
if action["type"] == "submit" or self.turns >= self.max_turns:
|
| 265 |
+
return self._grade()
|
| 266 |
+
obs = self.sandbox.exec(action) # edit / run-tests / read-file (allowlisted)
|
| 267 |
+
return StepResult(obs, 0.0, False, {"turn": self.turns})
|
| 268 |
+
|
| 269 |
+
def _grade(self) -> StepResult:
|
| 270 |
+
r = self.sandbox.run_tests(self.task.test_command,
|
| 271 |
+
self.task.fail_to_pass + self.task.pass_to_pass)
|
| 272 |
+
frac = r.n_pass(self.task.fail_to_pass) / max(len(self.task.fail_to_pass), 1)
|
| 273 |
+
guard_ok = r.all_pass(self.task.pass_to_pass) # "remains functional"
|
| 274 |
+
hacked = self.monitor.flag(self.sandbox.trajectory(), # AST + provenance (§3c)
|
| 275 |
+
self.task.deleted_symbols)
|
| 276 |
+
reward = frac * (1.0 if guard_ok and not hacked else 0.0)
|
| 277 |
+
return StepResult(r.stdout, reward, True,
|
| 278 |
+
{"frac": frac, "guard_ok": guard_ok, "hacked": hacked})
|
| 279 |
+
|
| 280 |
+
# ---- TRL GRPOTrainer face (reward_fn(prompts, completions, **kwargs)->list[float]) ----
|
| 281 |
+
def reward_fn(self, prompts, completions, *, task_id=None, **kwargs):
|
| 282 |
+
rewards = []
|
| 283 |
+
for comp, tid in zip(completions, task_id): # task_id passed via dataset column
|
| 284 |
+
task = self.registry[tid]
|
| 285 |
+
self.reset(task)
|
| 286 |
+
res = self._run_completion(comp) # replay agent turns from `comp`
|
| 287 |
+
rewards.append(res.reward)
|
| 288 |
+
self.curriculum.update(tid, n_pass=int(res.reward > 0), n_total=1) # §2 feedback
|
| 289 |
+
return rewards
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
**Wiring to GRPO (the dataset carries `task_id`; curriculum reweights the sampler):**
|
| 293 |
+
```python
|
| 294 |
+
# train/grpo_fd.py [EXTRAPOLATED]
|
| 295 |
+
from trl import GRPOTrainer, GRPOConfig
|
| 296 |
+
env = FeatureDeletionEnv(sandbox=LockedSandbox(...), monitor=HackMonitor(...))
|
| 297 |
+
ds = build_prompt_dataset(tasks) # columns: prompt, task_id (+ curriculum weights)
|
| 298 |
+
trainer = GRPOTrainer(
|
| 299 |
+
model="Qwen/Qwen3-Coder-7B", # v0.0 base (mapping row a)
|
| 300 |
+
args=GRPOConfig(num_generations=8, ...), # G=8 → 8 pass/fail obs per task per step → §2
|
| 301 |
+
reward_funcs=[env.reward_fn], # reward = masked test pass-fraction
|
| 302 |
+
train_dataset=ds,
|
| 303 |
+
)
|
| 304 |
+
trainer.train()
|
| 305 |
+
```
|
| 306 |
+
This slots into the same RLVR base as rows (c)/(d); the **SDPO hint-distill channel (row d, `research/05`) is orthogonal** and stacks on top — Feature Deletion supplies the *verifiable scalar reward* that SDPO's KL rides on. The `verifiers` library can wrap `reward_fn` for env composition if we run multiple generators.
|
| 307 |
+
|
| 308 |
+
---
|
| 309 |
+
|
| 310 |
+
## 7. Cost & feasibility at scale (CPU pools)
|
| 311 |
+
|
| 312 |
+
Feature-Deletion is **embarrassingly parallel and CPU-bound** — no GPU in the data-gen path (matches mapping §"Synthetic data: Generators run on CPU pool… Embarrassingly parallel"). Two cost buckets:
|
| 313 |
+
|
| 314 |
+
**(A) Task construction (one-time per task).** `[EXTRAPOLATED]` estimates:
|
| 315 |
+
- Path A (gold-patch revert) over a pre-built substrate image: `git apply -R` + scrub + one validation suite run. Dominated by the test run: **~30 s–5 min CPU** per task depending on suite size. Validation gate (§5c) needs ~2 suite runs (broken + gold-restored) → call it **~2–10 min CPU/task**.
|
| 316 |
+
- Path B (AST deletion): + a coverage run (~1.5–3× a normal suite run) + AST/CST manipulation (<1 s). **~5–20 min CPU/task.**
|
| 317 |
+
- **Throughput:** a 64-vCPU pool at ~8 min/task and 8 concurrent runners ≈ **~60 tasks/hr/node** → ~1,400 tasks/day/node. Inverting all 21k SWE-rebench instances ≈ **~15 node-days** on one 64-vCPU box, trivially parallel across nodes. Reaching a "25×-spirit" pool of ~50k–60k tasks (R2E-Gym 8.1k + SWE-rebench 21k + multi-feature/granularity escalations) is **<1 week on a modest CPU pool**.
|
| 318 |
+
- **Storage/images:** reuse substrate Docker images (SWE-Gym `xingyaoww/sweb.eval.*`, SWE-rebench per-instance `docker_image`) → **near-zero env-build cost**, sidestepping the "200 hours of manual env setup" bottleneck SWE-Gym reported. We only add a thin scrubbed overlay layer per task (~MBs).
|
| 319 |
+
|
| 320 |
+
**(B) Reward evaluation (recurring, in the RL loop).** This is the real running cost, not construction: each GRPO step runs `G` rollouts × (agent turns + final test run). Test execution is CPU; agent generation is the GPU/inference cost shared with rows (c)/(d). Levers: cache the broken image warm, run only `FAIL_TO_PASS + PASS_TO_PASS` (not the full suite), and retire solved tasks via §2 so we stop paying for tasks the model already aced ("begins to get most training problems correct").
|
| 321 |
+
|
| 322 |
+
**Feasibility verdict:** **Green.** Construction is cheap and one-time; the curriculum keeps the live pool small; the only nontrivial recurring cost (sandboxed test execution) is shared with any RLVR coding env we'd build anyway. The binding constraints are *engineering* (sandbox lockdown §3, validation gate §5c) and *licensing hygiene* (§4), not compute.
|
| 323 |
+
|
| 324 |
+
---
|
| 325 |
+
|
| 326 |
+
## 8. Open questions / reproducibility gaps (carried from blog silence)
|
| 327 |
+
|
| 328 |
+
1. **Deletion-target selection heuristic** — blog silent (`research/09` §1 "NO CHANGE"). We propose coverage-selectivity (§5 Path B); Cursor's actual heuristic is unknown.
|
| 329 |
+
2. **Deleter model vs. program** — blog implies an agent deletes ("asked to delete code… such that the codebase remains functional"); we default to *programmatic* deletion (cheaper, deterministic, no second model). An LLM-deleter is a v0.2 escalation.
|
| 330 |
+
3. **The other ~24 generators** — Feature Deletion is "one synthetic approach… a range of approaches"; the rest are unnamed. Out of scope here; this brief delivers the one named generator.
|
| 331 |
+
4. **"Agentic monitoring tools" internals** — unspecified; our §3c monitor is a best-effort programmatic stand-in.
|
| 332 |
+
5. **Composer2.pdf (arXiv:2603.24477)** — flagged by `research/09` action-item #1 as the likely home of data-mix % and generator inventory; **not yet extracted**. Recommend a follow-up pull before scaling the generator suite.
|
| 333 |
+
|
| 334 |
+
---
|
| 335 |
+
|
| 336 |
+
## Sources
|
| 337 |
+
|
| 338 |
+
- **Cursor blog** — *Introducing Composer 2.5*, [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5) (re-extracted 2026-05-28; §1/§3 quotes verbatim).
|
| 339 |
+
- **Composer 2 technical report** — [arXiv:2603.24477](https://arxiv.org/abs/2603.24477) / [cursor.com/resources/Composer2.pdf](https://cursor.com/resources/Composer2.pdf) (unread; flagged in `research/09`).
|
| 340 |
+
- **SWE-bench** — datasets guide [swebench.com/SWE-bench/guides/datasets](https://www.swebench.com/SWE-bench/guides/datasets); HF `SWE-bench/SWE-bench`, `SWE-bench/SWE-bench_Lite`, `SWE-bench/SWE-bench_Verified`.
|
| 341 |
+
- **SWE-Gym** — *Training Software Engineering Agents and Verifiers with SWE-Gym*, [arXiv:2412.21139](https://arxiv.org/abs/2412.21139) (ICML 2025); HF [`SWE-Gym/SWE-Gym`](https://huggingface.co/datasets/SWE-Gym/SWE-Gym) (2,438 inst, 11 repos), `SWE-Gym/SWE-Gym-Raw`; [github.com/SWE-Gym/SWE-Gym](https://github.com/SWE-Gym/SWE-Gym).
|
| 342 |
+
- **R2E-Gym** — *Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents* (Jain et al. 2025); [r2e-gym.github.io](https://r2e-gym.github.io); HF org [huggingface.co/R2E-Gym](https://huggingface.co/R2E-Gym) (`R2E-Gym-V1`, `R2E-Gym-Subset`, `SWE-Bench-Lite`); 8.1K executable envs, 13 repos.
|
| 343 |
+
- **SWE-rebench** — *An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents*, [arXiv:2505.20411](https://arxiv.org/pdf/2505.20411) (NeurIPS 2025); HF [`nebius/SWE-rebench`](https://huggingface.co/datasets/nebius/SWE-rebench) (21,336 tasks, 3,468 repos, CC-BY-4.0 + per-instance `license_name`), `nebius/SWE-bench-extra`, `nebius/SWE-agent-trajectories`.
|
| 344 |
+
- **OpenHands trajectories** — HF [`nvidia/Nemotron-SWE-v1`](https://huggingface.co/datasets/nvidia/Nemotron-SWE-v1) (59K OpenHands trajectories, CC-BY-4.0; issues from SWE-Gym + R2E-Gym-Subset).
|
| 345 |
+
- **TRL `GRPOTrainer`** — reward-fn convention `reward_fn(prompts, completions, **kwargs)->list[float]`, [trl/trainer/grpo_trainer.py](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py), [`RewardFunc` alias PR #5246](https://github.com/huggingface/trl/pull/5246), [GRPO docs](https://huggingface.co/docs/trl/main/en/grpo_trainer).
|
| 346 |
+
- **Internal:** `docs/COMPOSER_RECIPE_MAPPING.md` (§2, rows b/g), `research/09-composer-blog-delta-2026.md` (online-curriculum delta), `research/05-trace-replay-distillation.md` (orthogonal distill channel).
|
|
@@ -0,0 +1,366 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SDPO Hint Generation: How to Build the Teacher's "Privileged Info" for Composer's Targeted RL with Textual Feedback
|
| 2 |
+
|
| 3 |
+
> **Research date:** 2026-05-28.
|
| 4 |
+
> **Scope:** Resolves the **#1 open replication question** flagged in `docs/COMPOSER_RECIPE_MAPPING.md` §1 and `research/09-composer-blog-delta-2026.md` §2: *how are the hints generated?* This doc maps OPSD/SDPO's "privileged information" onto Composer's "hint," builds a cheapest→richest **taxonomy of hint sources**, ships a **concrete template library with actual strings**, specifies the **LLM-judge fallback prompt**, aligns **error-site detection** with `ingestion/trace_examples.py`, and proposes a **layered `HintGenerator` design** that slots into the existing `CollatorConfig.hint_generator` hook.
|
| 5 |
+
> **Method:** Primary-source pulls (Tavily advanced) of the SDPO abstract + method/ablation HTML ([arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802)), the OPSD method HTML + GitHub README ([arXiv:2601.18734v3](https://arxiv.org/abs/2601.18734), [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)), audited against the current `hint_generator.py`, `trainer/data_collator.py`, and `ingestion/trace_examples.py`.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## TL;DR
|
| 10 |
+
|
| 11 |
+
The hint **is** the teacher's privileged-information conditioning variable. Cursor never says how hints are generated, but the two cited papers bracket the answer precisely:
|
| 12 |
+
|
| 13 |
+
- **OPSD** conditions the teacher on `y⋆` = **ground-truth answer / reference CoT** — the strongest, most "privileged" signal, available only when you hold the solution.
|
| 14 |
+
- **SDPO** generalizes this to **environment feedback** that you *already have for free* at training time, and crucially ablates **three feedback types**: (1) a **successful sibling rollout** ("sample solution"), (2) the **environment output** (runtime errors / judge text), and (3) the **student's own original attempt**. The teacher is the *same weights* conditioned on that feedback; the student is the same weights without it; loss is a per-token KL on the student's trajectory, gradient through the student only, teacher stop-grad.
|
| 15 |
+
|
| 16 |
+
Composer's "hint" is therefore **not one thing** — it is *whatever cheap, locally-available text shifts the teacher distribution toward the correct continuation*. That reframing makes the open question tractable: build a **layered generator** that tries the cheapest source first and escalates only on miss:
|
| 17 |
+
|
| 18 |
+
```
|
| 19 |
+
template-by-error-kind → raw-tool-error-as-hint → LLM-judge hint → SDPO successful-sibling bootstrap
|
| 20 |
+
(free, deterministic) (free, structural) (~$0.0005/site) (free, needs a rollout group)
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
The current `hint_generator.py` implements **only the first layer** (5 templates) and is the right v0.1 starting point. This doc specifies layers 2–4 and a clean `HintGenerator` Protocol so they compose behind the existing `Callable[[str, dict], str | None]` hook with **zero collator changes**.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## 1. How OPSD & SDPO obtain the teacher's "privileged info" → Composer's "hint"
|
| 28 |
+
|
| 29 |
+
Both methods build teacher and student from **a single LLM** and differ only in *what extra text the teacher gets to see*. That extra text is the privileged-information variable. Composer's "hint inserted into the local context" is exactly this variable.
|
| 30 |
+
|
| 31 |
+
### 1.1 OPSD — privileged info = ground-truth answer
|
| 32 |
+
|
| 33 |
+
> *"The teacher policy is provided with privileged information `y⋆`, such as the **ground-truth answer or a reference chain-of-thought**, while the student policy conditions only on the problem `x`. … the teacher policy `p_T(·|x, y⋆)` conditions on both the problem and the privileged answer, whereas the student policy `p_S(·|x)` observes only the problem. We preserve the on-policy training paradigm by sampling trajectories `ŷ` exclusively from the student policy, which then receives dense, token-level supervision from the privileged teacher policy."* — OPSD, [arXiv:2601.18734v3](https://arxiv.org/html/2601.18734v3)
|
| 34 |
+
|
| 35 |
+
OPSD loss (Eq. 8, verbatim structure):
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
L(θ) = E_{(x, y⋆)~S} [ E_{ŷ~p_S(·|x)} [ D( p_T ‖ p_S )(ŷ | x) ] ]
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
> *"Gradients are backpropagated only through the student policy `p_S`, while the teacher `p_T` acts as a fixed full-distribution target conditioned on privileged information `(x, y⋆)`."*
|
| 42 |
+
|
| 43 |
+
**Map to Composer:** `y⋆` ≙ the **hint**. In OPSD the hint is maximally strong (the answer itself). In a coding agent you rarely have the answer at an arbitrary turn — so the OPSD form is the *upper bound* of hint strength, usable only for the subset of error sites where a reference exists (e.g. the deleted code in a Feature-Deletion task, or a known-good tool signature).
|
| 44 |
+
|
| 45 |
+
Two OPSD implementation handles that transfer directly (from the [GitHub README](https://github.com/siyan-zhao/OPSD)):
|
| 46 |
+
- **`--reason_first`**: *"Prepend an explicit rationalization to the teacher context before distillation."* This is OPSD's own knob for **same-model introspection** (taxonomy class (d) below) — the teacher is first asked to rationalize *why* the privileged info implies the answer, then distilled. Evidence the introspection-hint path is real and works.
|
| 47 |
+
- **`--jsd_token_clip`** (default `0.05`): *"Clip the JSD loss for each token … This can improve stability by preventing **stylistic tokens from dominating** the training signal."* Directly relevant to Composer's **style/communication** behavior targets — without clipping, distilling a style hint can be dominated by a few high-divergence stylistic tokens. Our collator's `sdpo_loss_mask` already isolates post-hint tokens; token-clipping is the complementary per-token stabilizer.
|
| 48 |
+
|
| 49 |
+
### 1.2 SDPO — privileged info = environment feedback (three sources, ablated)
|
| 50 |
+
|
| 51 |
+
> *"SDPO treats the current model **conditioned on feedback** as a self-teacher and distills its feedback-informed next-token predictions back into the policy. … SDPO leverages the model's ability to **retrospectively identify its own mistakes in-context**."* — SDPO abstract, [arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802)
|
| 52 |
+
|
| 53 |
+
SDPO explicitly ablates **three feedback types present in a verifiable coding environment** ([SDPO method HTML](https://arxiv.org/html/2601.20802v2)):
|
| 54 |
+
|
| 55 |
+
> *"we ablate the three types of feedback present in a verifiable environment like code generation: **the sample solution** (if a successful rollout is available in the current rollout group), **the environment output** (such as runtime errors), and **the student's original attempt**."*
|
| 56 |
+
|
| 57 |
+
This is the load-bearing finding for our taxonomy. Each maps to a distinct hint source:
|
| 58 |
+
|
| 59 |
+
| SDPO feedback type | What it is | Composer "hint" equivalent | Taxonomy class (§2) |
|
| 60 |
+
|---|---|---|---|
|
| 61 |
+
| **Sample solution** | A *successful sibling rollout* from the same prompt's rollout group | Bootstrap hint: "Here is a working approach: …" | **(f)** SDPO successful-sibling bootstrap |
|
| 62 |
+
| **Environment output** | Runtime error / judge text returned by the env | Raw tool-error text spliced as the hint | **(b)** raw-tool-error-as-hint |
|
| 63 |
+
| **Student's original attempt** | The model's own failed text, re-shown | Self-introspection prompt | **(d)** same-model introspection |
|
| 64 |
+
|
| 65 |
+
The key SDPO lever for the **hint-absent case** (called out in `09-composer-blog-delta-2026.md` §3 action item 3):
|
| 66 |
+
|
| 67 |
+
> *"SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by **using successful rollouts as implicit feedback for failed attempts**."*
|
| 68 |
+
|
| 69 |
+
i.e. when there is **no external hint source**, you can still manufacture privileged info by letting the teacher condition on a *sibling rollout that passed*. This is free (you already paid for the rollout group under GRPO) and is the natural last fallback before giving up on an error site.
|
| 70 |
+
|
| 71 |
+
### 1.3 The exact mechanism nuance to preserve (from the Cursor blog, via delta doc §2)
|
| 72 |
+
|
| 73 |
+
> *"This hint **changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement**. For that turn only, we then **update the student weights towards the new probabilities**."*
|
| 74 |
+
|
| 75 |
+
Two facts the hint generator must respect:
|
| 76 |
+
1. **Teacher = hint-conditioned forward pass of the same weights** (not a re-rollout, not a separate model). The generator's job is only to *produce the text spliced into the teacher context* — the collator (`_build_hint_injected_trace`) already does the splicing, and the trainer does the forward pass.
|
| 77 |
+
2. **Student weights are trainable; teacher is stop-grad.** The generator never touches the loss; it only conditions the teacher. So **a wrong hint is bounded-bad** — it produces a noisier teacher target at one masked turn, not a corrupted reward. This is why we can afford cheap/heuristic hints and only escalate on miss.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## 2. Taxonomy of hint sources — cheapest → richest
|
| 82 |
+
|
| 83 |
+
For each class: applicability, cost, and which **Composer behavior class** it covers. Composer's three stated behavior targets are **tool use, coding style, and model communication** (`09-composer-blog-delta-2026.md` §2), plus **effort calibration** (blog §"behavioral aspects"). Tool errors are the cheap, structural case; style/communication/effort are the hard cases templates can't reach.
|
| 84 |
+
|
| 85 |
+
| # | Hint source | How obtained | Cost / latency | Determinism | Tool err | Style | Comms | Effort |
|
| 86 |
+
|---|---|---|---|---|:--:|:--:|:--:|:--:|
|
| 87 |
+
| **(a)** | **Hardcoded template by error_kind** | Pattern-match `error_kind`, fill slots (`available_tools`, `tool_schema`) | **Free**, ~µs | Fully deterministic | ✅ strong | ⚠️ rigid | ⚠️ rigid | ❌ |
|
| 88 |
+
| **(b)** | **Raw tool-error text as hint** | Pass the env's error string through (optionally truncated) | **Free**, ~µs | Deterministic | ✅ strong | ❌ | ❌ | ❌ |
|
| 89 |
+
| **(c)** | **LLM-judge natural-language hint** | Call a cheap judge model with `(state, erroring_action, tool_output)` | ~$0.0003–0.001/site, ~0.5–2 s | Stochastic | ✅ | ✅ | ✅ | ✅ |
|
| 90 |
+
| **(d)** | **Same-model introspection** | Re-prompt the *training model* to critique its own failed turn (OPSD `--reason_first`) | **Free GPU** (1 extra gen), ~0.3–1 s | Stochastic | ✅ | ✅ | ✅ | ✅ |
|
| 91 |
+
| **(e)** | **Learned hint generator** | A small fine-tuned model trained to emit hints (defer to v0.2+) | Train-time cost + inference | Stochastic | ✅ | ✅ | ✅ | ✅ |
|
| 92 |
+
| **(f)** | **SDPO successful-sibling bootstrap** | Pick a *passing* rollout from the same prompt's GRPO group; condition teacher on it | **Free** (reuses rollout group), ~µs to select | Deterministic given group | ✅ | ✅ | ✅ | ✅ (shows a *terser* success) |
|
| 93 |
+
|
| 94 |
+
**Reading of the table:**
|
| 95 |
+
- **(a)+(b) cover the tool-use behavior class almost entirely** and are free + deterministic → make them the default first layer. This is the "easy case" the mapping doc warns about (`COMPOSER_RECIPE_MAPPING.md` §"Why deferring … is the right call", point 2): they *do not* validate the harder behavior cases.
|
| 96 |
+
- **Style / communication / effort-calibration are NOT pattern-matchable.** "This explanation was wasteful" or "this code violates house style" requires class **(c)**, **(d)**, or **(f)**. This is the real content of the open question.
|
| 97 |
+
- **(f) is the unique unlock** when no external hint source exists *and* you don't want an API call: it manufactures privileged info from the model's own successes. It is the natural fallback and also the cheapest source for style/comms/effort because a *successful sibling* implicitly demonstrates the desired style/terseness without anyone writing a rule.
|
| 98 |
+
- **(e) learned generator** is explicitly v0.2 (`COMPOSER_RECIPE_MAPPING.md` table row (d): "+ learned hint generator"). Out of scope to build now; the Protocol below makes it a drop-in later.
|
| 99 |
+
|
| 100 |
+
**Recommended escalation order (rationale):** deterministic-and-free before stochastic, structural before semantic, no-API before API. → `(a) → (b) → [(c) xor (d)] → (f)` with `(f)` as the "nothing else fired but we have a passing sibling" backstop.
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## 3. Concrete template library (actual strings)
|
| 105 |
+
|
| 106 |
+
This **extends** the current `hint_generator.py` registry (which already ships `tool_not_found`, `json_decode`, `type_error`, `runtime_error`, `repeated_failure`). New/expanded templates below are written to the **same `HintContext` TypedDict** and same `dispatch(error_kind, ctx)` contract, so they register without touching the collator. All keep the blog's *"Reminder: …"* register (the one verbatim example Cursor published was `"Reminder: Available tools are…"`).
|
| 107 |
+
|
| 108 |
+
| error_kind | Trigger | Hint string (template) |
|
| 109 |
+
|---|---|---|
|
| 110 |
+
| `tool_not_found` | invalid tool name | `Reminder: Available tools are: {tool_list}. The tool you called does not exist — use one of these.` |
|
| 111 |
+
| `malformed_args` / `json_decode` | unparseable tool args / JSON | `Reminder: tool arguments must be a single valid JSON object. Common mistakes: single quotes (use double quotes), trailing commas, unescaped newlines inside strings, or wrapping the JSON in markdown fences.` |
|
| 112 |
+
| `schema_mismatch` / `type_error` | args parse but violate schema | `Reminder: \`{tool_name}\` expects arguments matching this schema:\n {tool_schema}\nYour call is missing/mistyped: {bad_fields}. Re-issue with arguments matching the schema.` |
|
| 113 |
+
| `failing_test` | test suite returns non-zero / assertion | `Reminder: the test \`{test_name}\` is still failing: {assertion_excerpt}. Re-read the failing test's expectations and adjust the implementation to satisfy them — do not modify the test.` |
|
| 114 |
+
| `lint_style` | linter/formatter non-zero exit | `Reminder: this change violates the project style ({linter}: {rule_id} — {rule_msg}). Match the surrounding code's conventions (imports, naming, formatting) before proceeding.` |
|
| 115 |
+
| `wasteful_action` | redundant/no-op action (effort calibration) | `Reminder: this step repeated work already done (you already {prior_action}). Skip redundant reads/searches and act on what you know; prefer the most direct path to the goal.` |
|
| 116 |
+
| `repeated_failure` | same error_kind ≥3× consecutively | `Reminder: this approach has failed {n} times. Step back and try a different strategy: read more of the surrounding code, search for an existing working example, or decompose the task differently.` |
|
| 117 |
+
| `verbose_communication` | judge-flagged over-long message (comms) | `Reminder: keep the response concise and focused on the user's request. State what you did and why in 1–2 sentences; omit restating the task and step-by-step narration.` |
|
| 118 |
+
|
| 119 |
+
Notes:
|
| 120 |
+
- `{tool_list}`, `{tool_schema}`, `{bad_fields}`, etc. are filled from `HintContext` (`available_tools`, `tool_schema`, `tool_name`) and from new optional keys (`test_name`, `assertion_excerpt`, `linter`, `rule_id`, `rule_msg`, `prior_action`, `n`).
|
| 121 |
+
- `failing_test`, `lint_style`, `wasteful_action`, `verbose_communication` are **new** and extend coverage from tool-use into the style/comms/effort behavior classes at the *template tier* — but they are deliberately generic; the high-quality versions of these come from the LLM-judge (§4) or sibling-bootstrap (§2 class (f)).
|
| 122 |
+
- Truncate `{assertion_excerpt}` / `{tool_schema}` to ~200 chars (matches the `source_content_excerpt[:200]` convention already used in `trace_examples.py`) to keep the injected hint short — the blog stresses the hint is **local and short**.
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## 4. LLM-judge path (class (c))
|
| 127 |
+
|
| 128 |
+
When no template fires, or when the behavior class is style/comms/effort, call a cheap judge to emit a ≤2-sentence corrective hint. The judge sees the *failed* turn and the *environment's* reaction — never the ground truth (we usually don't have it) — and is asked to produce the *minimal corrective nudge* that the teacher will then condition on.
|
| 129 |
+
|
| 130 |
+
### 4.1 Prompt template
|
| 131 |
+
|
| 132 |
+
```text
|
| 133 |
+
SYSTEM:
|
| 134 |
+
You write a single, short corrective hint for a coding agent that just made a
|
| 135 |
+
mistake. The hint will be inserted into the agent's context so it can retry the
|
| 136 |
+
SAME turn. Output ONE hint of AT MOST 2 sentences. Be concrete and actionable.
|
| 137 |
+
Do NOT solve the task, do NOT write code, do NOT explain your reasoning. If the
|
| 138 |
+
action was actually fine, output exactly: NO_HINT.
|
| 139 |
+
|
| 140 |
+
USER:
|
| 141 |
+
## Conversation state (last {k} turns)
|
| 142 |
+
{state}
|
| 143 |
+
|
| 144 |
+
## The action that went wrong
|
| 145 |
+
{erroring_action}
|
| 146 |
+
|
| 147 |
+
## What the environment returned
|
| 148 |
+
{tool_output}
|
| 149 |
+
|
| 150 |
+
## Behavior dimension to correct (one of: tool_use | style | communication | effort)
|
| 151 |
+
{behavior_dim}
|
| 152 |
+
|
| 153 |
+
Write the hint now (≤2 sentences, or NO_HINT):
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
- `{state}` = last `k≈3` turns (truncate to a token budget, e.g. 1.5k tokens).
|
| 157 |
+
- `{erroring_action}` = the assistant turn's tool call / message that failed.
|
| 158 |
+
- `{tool_output}` = the env error or judge text (the same string class (b) would pass raw).
|
| 159 |
+
- `{behavior_dim}` = routed from the error-site detector (§5): structural tool errors → `tool_use`; judge-flagged turns → `style`/`communication`/`effort`.
|
| 160 |
+
- `NO_HINT` sentinel maps to the generator returning `None` (skip the SDPO site), preventing the collator from minting a zero-signal row (the collator already guards "hint AND recovery content" — `data_collator.py` L308).
|
| 161 |
+
|
| 162 |
+
### 4.2 Model tier & rough cost
|
| 163 |
+
|
| 164 |
+
- **Tier:** a *small/cheap* instruct model is sufficient — the task is "spot the obvious mistake and say it in 2 sentences," not solve. Candidates: a 7–8B local model already loaded for rollouts (zero marginal $), or a hosted small model (e.g. Sonnet-class / GPT-mini-class via OpenRouter, consistent with the existing `hint_generator.py` docstring that names "Sonnet 4.6 or Opus 4.7 via OpenRouter" for v0.2).
|
| 165 |
+
- **Cost (hosted small model):** input ≈ 1.5k–2k tok (state + action + output), output ≤ ~60 tok. At ~$0.15/M in + ~$0.60/M out that is **≈ $0.0003–0.0006 per error site**. With error sites at, say, 1–3 per trace, this is **~$0.001–0.002/trace** — an order of magnitude cheaper than the trace-replay channel's ~$0.30/trace (`COMPOSER_RECIPE_MAPPING.md` §"three reward channels"), and only paid on **template misses**.
|
| 166 |
+
- **Cost (local judge):** effectively free GPU time; preferred at scale. Use the hosted path for v0.1 quality calibration, then distill to local.
|
| 167 |
+
- **Caching:** hints are deterministic-enough to cache keyed on `hash(erroring_action + tool_output + behavior_dim)`; repeated identical error sites across a training run reuse the hint for free.
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## 5. Error-site detection in a trace (align with `ingestion/trace_examples.py`)
|
| 172 |
+
|
| 173 |
+
The hint generator must only fire at **error sites**. The pipeline already has two layers of structural detection that the generator must align with — do **not** invent a parallel detector.
|
| 174 |
+
|
| 175 |
+
**Existing structural detection (authoritative, do not duplicate):**
|
| 176 |
+
1. **Ingestion → `trace_examples.py`** sets `turn["tool_error"] = <error_kind>` on the assistant turn *immediately after* an error tool-result. It detects errors via:
|
| 177 |
+
- **Structural flag first** (`_user_turn_has_error`): the ingester sets `tool_error: True` on user messages whose source JSONL had `is_error: true`. **This is the source of truth.**
|
| 178 |
+
- **String-tag fallback**: matches `TOOL_ERROR_TAG = "[TOOL_RESULT (ERROR)]"` only when no structural flag is present (older traces).
|
| 179 |
+
- **`error_kind` classification** (`default_classify_error`): keyword regex → `command_not_found`, `file_not_found`, `permission_denied`, `syntax_error`, `connection_error`, else `tool_error`.
|
| 180 |
+
2. **Collator → `data_collator.py`** (`_is_error_turn`): an error site iff `turn.get("tool_error") is not None`, AND it only mints an SDPO row when **both** a hint is produced **and** the recovery turn has content (L308) — so empty-recovery sites are skipped.
|
| 181 |
+
|
| 182 |
+
**Extending the detector for the new behavior/error classes (additions, not replacements).** Keep `error_kind` as the routing key the generator already receives, and broaden the classifier so the new templates (§3) and the judge router (§4) get the right `behavior_dim`:
|
| 183 |
+
|
| 184 |
+
| Signal in trace | Detected via | New `error_kind` → behavior_dim |
|
| 185 |
+
|---|---|---|
|
| 186 |
+
| Failed tool status / `is_error: true` | structural flag (existing) | `tool_error`/`tool_not_found`/… → `tool_use` |
|
| 187 |
+
| Exception traceback in tool output | regex `Traceback (most recent call last)` / `Error:` | `runtime_error` → `tool_use` |
|
| 188 |
+
| Malformed args / JSON | parse failure of the tool-call args | `malformed_args`/`json_decode` → `tool_use` |
|
| 189 |
+
| Test runner non-zero exit / assertion | regex `FAILED|AssertionError|[0-9]+ failed` in output | `failing_test` → `tool_use` (verifiable) |
|
| 190 |
+
| Linter/formatter non-zero exit | regex `{ruff|eslint|flake8|black}.*(error|would reformat)`; nonzero exit code | `lint_style` → `style` |
|
| 191 |
+
| Redundant/no-op action | heuristic: action equals a prior action's signature; or no state delta | `wasteful_action` → `effort` |
|
| 192 |
+
| Over-long / off-task assistant message | **LLM-judge flag only** (no structural signal) | `verbose_communication` → `communication` |
|
| 193 |
+
|
| 194 |
+
Implementation alignment rule: **add these as new `(kind, regex)` rows to `_ERROR_KIND_PATTERNS` in `trace_examples.py`** (same ordered-precedence mechanism already there — note its comment that `command_not_found` must precede `file_not_found`), so detection stays in **one** place and the generator stays a pure `error_kind → hint` function. Style/comms/effort sites that have **no structural signature** are surfaced only by the judge and should be gated (sampled, e.g. 10–20% of clean turns) to bound cost.
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## 6. Recommended layered design + `HintGenerator` Protocol
|
| 199 |
+
|
| 200 |
+
### 6.1 Protocol
|
| 201 |
+
|
| 202 |
+
A clean, typed Protocol that subsumes the current `dispatch` and the existing `CollatorConfig.hint_generator: Callable[[str, dict], str | None]` hook. The collator calls `generator(error_kind, error_meta)`; we wrap the Protocol with a tiny adapter so **no collator change is required**.
|
| 203 |
+
|
| 204 |
+
```python
|
| 205 |
+
# composer_replication/hints/protocol.py
|
| 206 |
+
from __future__ import annotations
|
| 207 |
+
from typing import Protocol, TypedDict, runtime_checkable
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
class ErrorContext(TypedDict, total=False):
|
| 211 |
+
"""Everything a hint source might need. Superset of the current HintContext."""
|
| 212 |
+
error_kind: str # routing key from trace_examples classifier
|
| 213 |
+
behavior_dim: str # "tool_use" | "style" | "communication" | "effort"
|
| 214 |
+
error_message: str # raw env/tool error text (enables class (b))
|
| 215 |
+
available_tools: list[str]
|
| 216 |
+
tool_name: str
|
| 217 |
+
tool_schema: dict
|
| 218 |
+
state_excerpt: str # last-k turns, for the judge (class (c)/(d))
|
| 219 |
+
erroring_action: str # the failed assistant turn
|
| 220 |
+
sibling_rollouts: list[dict] # GRPO group; passing ones enable class (f)
|
| 221 |
+
repeat_count: int # for repeated_failure
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
@runtime_checkable
|
| 225 |
+
class HintGenerator(Protocol):
|
| 226 |
+
"""A hint source. Returns the hint text, or None to decline this site."""
|
| 227 |
+
def generate(self, ctx: ErrorContext) -> str | None: ...
|
| 228 |
+
```
|
| 229 |
+
|
| 230 |
+
### 6.2 Layered composite (template-first → judge → sibling-bootstrap)
|
| 231 |
+
|
| 232 |
+
```python
|
| 233 |
+
# composer_replication/hints/layered.py
|
| 234 |
+
from dataclasses import dataclass, field
|
| 235 |
+
from .protocol import HintGenerator, ErrorContext
|
| 236 |
+
from . import templates # wraps the existing HINT_TEMPLATES registry
|
| 237 |
+
from . import judge # LLM-judge generator (class (c))
|
| 238 |
+
from . import sibling # SDPO successful-sibling bootstrap (class (f))
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
@dataclass
|
| 242 |
+
class LayeredHintGenerator:
|
| 243 |
+
"""Try each source in order; first non-None wins. A wrong hint is
|
| 244 |
+
bounded-bad (teacher is stop-grad), so cheap layers go first and we
|
| 245 |
+
only escalate to paid/learned layers on a miss."""
|
| 246 |
+
layers: list[HintGenerator] = field(default_factory=list)
|
| 247 |
+
|
| 248 |
+
def generate(self, ctx: ErrorContext) -> str | None:
|
| 249 |
+
for layer in self.layers:
|
| 250 |
+
hint = layer.generate(ctx)
|
| 251 |
+
if hint: # non-empty, non-None
|
| 252 |
+
return hint
|
| 253 |
+
return None # collator then skips this SDPO site
|
| 254 |
+
|
| 255 |
+
# Adapter for the existing CollatorConfig.hint_generator signature.
|
| 256 |
+
def as_collator_hook(self):
|
| 257 |
+
def hook(error_kind: str, error_meta: dict) -> str | None:
|
| 258 |
+
ctx: ErrorContext = {"error_kind": error_kind, **(error_meta or {})}
|
| 259 |
+
return self.generate(ctx)
|
| 260 |
+
return hook
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
def default_layered(*, judge_client=None, enable_judge=True) -> LayeredHintGenerator:
|
| 264 |
+
layers: list[HintGenerator] = [
|
| 265 |
+
templates.TemplateHintGenerator(), # (a) free, deterministic
|
| 266 |
+
templates.RawErrorHintGenerator(), # (b) raw env error as hint
|
| 267 |
+
]
|
| 268 |
+
if enable_judge and judge_client is not None:
|
| 269 |
+
layers.append(judge.JudgeHintGenerator(judge_client)) # (c)
|
| 270 |
+
layers.append(sibling.SiblingBootstrapGenerator()) # (f) backstop
|
| 271 |
+
return LayeredHintGenerator(layers=layers)
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
Wiring (unchanged collator contract):
|
| 275 |
+
|
| 276 |
+
```python
|
| 277 |
+
gen = default_layered(judge_client=my_small_model_client)
|
| 278 |
+
config = CollatorConfig(hint_generator=gen.as_collator_hook(), enable_sdpo=True)
|
| 279 |
+
collator = ComposerDataCollator(tokenizer=tok, config=config)
|
| 280 |
+
```
|
| 281 |
+
|
| 282 |
+
### 6.3 The three new layers (sketches)
|
| 283 |
+
|
| 284 |
+
```python
|
| 285 |
+
# (a)+(b) templates.py — reuse the EXISTING registry verbatim
|
| 286 |
+
from composer_replication.hint_generator import dispatch # current module
|
| 287 |
+
|
| 288 |
+
class TemplateHintGenerator:
|
| 289 |
+
def generate(self, ctx):
|
| 290 |
+
return dispatch(ctx.get("error_kind", ""), ctx) # None on unknown kind
|
| 291 |
+
|
| 292 |
+
class RawErrorHintGenerator:
|
| 293 |
+
"""Class (b): SDPO 'environment output' feedback — splice the raw error."""
|
| 294 |
+
def generate(self, ctx):
|
| 295 |
+
msg = (ctx.get("error_message") or "").strip()
|
| 296 |
+
if not msg:
|
| 297 |
+
return None
|
| 298 |
+
return f"Reminder: the previous action returned this error:\n{msg[:200]}\nFix the cause and retry."
|
| 299 |
+
```
|
| 300 |
+
|
| 301 |
+
```python
|
| 302 |
+
# (c) judge.py — class (c), prompt from §4.1
|
| 303 |
+
class JudgeHintGenerator:
|
| 304 |
+
def __init__(self, client, cache=None):
|
| 305 |
+
self.client, self.cache = client, (cache if cache is not None else {})
|
| 306 |
+
def generate(self, ctx):
|
| 307 |
+
key = hash((ctx.get("erroring_action"), ctx.get("error_message"),
|
| 308 |
+
ctx.get("behavior_dim")))
|
| 309 |
+
if key in self.cache:
|
| 310 |
+
return self.cache[key]
|
| 311 |
+
hint = self.client.complete(_build_judge_prompt(ctx)) # §4.1 template
|
| 312 |
+
hint = None if hint.strip() == "NO_HINT" else hint.strip()
|
| 313 |
+
self.cache[key] = hint
|
| 314 |
+
return hint
|
| 315 |
+
```
|
| 316 |
+
|
| 317 |
+
```python
|
| 318 |
+
# (f) sibling.py — SDPO 'successful rollouts as implicit feedback'
|
| 319 |
+
class SiblingBootstrapGenerator:
|
| 320 |
+
"""Class (f): when nothing else fired but a sibling rollout in the same
|
| 321 |
+
GRPO group PASSED, condition the teacher on that success."""
|
| 322 |
+
def generate(self, ctx):
|
| 323 |
+
sibs = ctx.get("sibling_rollouts") or []
|
| 324 |
+
winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
|
| 325 |
+
if not winners:
|
| 326 |
+
return None
|
| 327 |
+
best = max(winners, key=lambda s: s["reward"])
|
| 328 |
+
snippet = (best.get("solution_excerpt") or "")[:200]
|
| 329 |
+
return ("Reminder: a working approach for this task looks like:\n"
|
| 330 |
+
f"{snippet}\nAdapt this to the current step.")
|
| 331 |
+
```
|
| 332 |
+
|
| 333 |
+
> **Note on class (d):** same-model introspection (OPSD `--reason_first`) is the *training model* critiquing its own turn — best implemented inside the trainer (where the model is loaded) rather than the collator, since it needs a model forward pass. Add it as a fourth layer once the trainer exposes a `self_critique(ctx) -> str` callable; the Protocol already supports it. For v0.1, the judge (c) is the simpler stand-in for the same role.
|
| 334 |
+
|
| 335 |
+
### 6.4 Why this order (decision summary)
|
| 336 |
+
|
| 337 |
+
1. **Templates + raw-error (a/b)** are free, deterministic, and cover the **tool-use** class — the bulk of structural error sites. They reproduce Cursor's one published example (`"Reminder: Available tools are…"`) exactly.
|
| 338 |
+
2. **Judge (c)** is the only layer that *manufactures* a corrective for **style / communication / effort**, the behavior classes the mapping doc flags as the real test of the recipe (`COMPOSER_RECIPE_MAPPING.md` §"point 2"). Gated + cached → ~$0.0005/site, paid only on template miss.
|
| 339 |
+
3. **Sibling-bootstrap (f)** is the SDPO-native fallback when there's no template, no judge (or judge declined), but the rollout group contains a winner — *free* privileged info from the model's own success. This is the lever `09-composer-blog-delta-2026.md` §3 action item 3 told us to record.
|
| 340 |
+
4. **Learned generator (e)** drops in as a new layer in v0.2 (`COMPOSER_RECIPE_MAPPING.md` table row (d): "+ learned hint generator") without touching the Protocol or the collator.
|
| 341 |
+
|
| 342 |
+
---
|
| 343 |
+
|
| 344 |
+
## 7. Implementation handles (v0.1)
|
| 345 |
+
|
| 346 |
+
Concrete, ordered work items. Everything below preserves the existing `CollatorConfig.hint_generator: Callable[[str, dict], str | None]` contract — **no collator surgery**.
|
| 347 |
+
|
| 348 |
+
1. **Keep `hint_generator.py` as layer (a).** It already implements 5 templates with the right `dispatch(error_kind, ctx) -> str | None` signature. Add the four new templates from §3 (`failing_test`, `lint_style`, `wasteful_action`, `verbose_communication`) via `register(...)`. **Actual strings shipped in §3** — copy verbatim.
|
| 349 |
+
2. **Add the new error_kind regexes to `_ERROR_KIND_PATTERNS` in `trace_examples.py`** (§5 table). Single source of detection truth; preserve the ordered-precedence comment pattern (`command_not_found` before `file_not_found`). Route each `error_kind → behavior_dim` so the judge gets correct routing.
|
| 350 |
+
3. **Build `composer_replication/hints/`** with `protocol.py`, `layered.py`, `templates.py`, `judge.py`, `sibling.py` (§6 sketches). `templates.py` *imports the existing `dispatch`* — do not reimplement.
|
| 351 |
+
4. **Wire via the adapter:** `CollatorConfig(hint_generator=default_layered(...).as_collator_hook())`. The `claude_states_to_trace_examples` adapter already populates `error_meta` (`source_content_excerpt[:200]`); extend it to also stash `error_message` (for class (b)) and, when available from the GRPO loop, `sibling_rollouts` (for class (f)).
|
| 352 |
+
5. **Borrow OPSD stabilizers for the loss side:** when distilling style/comms hints, apply **per-token JSD clipping** (OPSD `--jsd_token_clip`, default `0.05`) so "stylistic tokens" don't dominate — the README states this is exactly why it exists. Pair with the collator's existing `sdpo_loss_mask` (post-hint tokens only).
|
| 353 |
+
6. **Gate the judge:** fire (c) only on (i) template miss, or (ii) a sampled fraction (~10–20%) of clean turns flagged for style/comms/effort, with hint caching keyed on `(erroring_action, error_message, behavior_dim)`. Bounds cost at ~$0.001–0.002/trace.
|
| 354 |
+
7. **Eval the generator independently of training** (matches `COMPOSER_RECIPE_MAPPING.md` concern that "SDPO with hardcoded templates is the easy case"): measure (a) % of error sites that get a non-None hint per layer, (b) teacher-vs-student KL *increase* at hinted turns (a good hint should *raise* divergence — it's shifting probability toward the fix, per the blog's "lowering wrong-tool, raising valid-replacement"), and (c) for style/comms, a held-out judge agreeing the hint is corrective. A hint that doesn't move the teacher distribution is a no-op and should be pruned.
|
| 355 |
+
|
| 356 |
+
---
|
| 357 |
+
|
| 358 |
+
## 8. Citations
|
| 359 |
+
|
| 360 |
+
- **SDPO** — Hübotter, Lübeck, Behric, Baumann, Bagatella, Marta, Hakimi, Shenfeld, Kleine Buening, Guestrin, Krause (ETH Zürich), *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802) (v1 28 Jan 2026, v2 16 Feb 2026), CC-BY-4.0. Abstract + method/ablation HTML ([html v2](https://arxiv.org/html/2601.20802v2)). The three-feedback-type ablation (sample solution / environment output / student's original attempt) and the "successful rollouts as implicit feedback" claim are the load-bearing sources for §1.2 and taxonomy classes (b), (d), (f).
|
| 361 |
+
- **OPSD** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models*, [arXiv:2601.18734v3](https://arxiv.org/abs/2601.18734), code [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) (paper CC-BY-4.0; verify code license on repo). Privileged-info teacher (`y⋆` = ground-truth/reference CoT), Eq. 8 loss, stop-grad teacher, and the `--reason_first` (introspection) + `--jsd_token_clip` (stylistic-token stabilizer) flags are the sources for §1.1 and the OPSD handles in §7.
|
| 362 |
+
- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026): the "Reminder: Available tools are…" example, "hint changes the teacher probabilities / update student weights for that turn only," and the three behavior targets (tool use, coding style, model communication). Via `docs/COMPOSER_RECIPE_MAPPING.md` §1 and `research/09-composer-blog-delta-2026.md` §2.
|
| 363 |
+
- **Composer 2 technical report** — [arXiv:2603.24477](https://arxiv.org/abs/2603.24477) / [Composer2.pdf](https://cursor.com/resources/Composer2.pdf) (Rush et al.): flagged in the delta doc as the most likely place to resolve hint-generation directly; **still unread** — a dedicated extraction is the recommended follow-up if this design needs validation against Cursor's actual mechanism.
|
| 364 |
+
- **In-repo (audited):** `composer_replication/hint_generator.py` (current layer (a)), `composer_replication/trainer/data_collator.py` (`CollatorConfig.hint_generator` hook, `_build_hint_injected_trace`, hint-AND-recovery gate at L308), `composer_replication/ingestion/trace_examples.py` (structural error detection, `_ERROR_KIND_PATTERNS`, `default_classify_error`).
|
| 365 |
+
|
| 366 |
+
> **Residual gap:** Cursor still never states which hint source they use; this design *brackets* their unknown choice with the OPSD (privileged-answer) and SDPO (environment-feedback + sibling-bootstrap) endpoints and makes all of them composable. The one unread artifact that could collapse the bracket is Composer2.pdf (arXiv:2603.24477).
|
|
@@ -0,0 +1,499 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SDPO ⊕ Dr. GRPO: wiring the on-policy KL-at-error-turns channel into a live RL loop
|
| 2 |
+
|
| 3 |
+
> **Design date:** 2026-05-28.
|
| 4 |
+
> **Scope:** A concrete, implementable design for adding the SDPO auxiliary
|
| 5 |
+
> loss channel (on-policy KL at error turns, teacher = same weights conditioned
|
| 6 |
+
> on a hint) as a **second loss head** on a live **Dr. GRPO** update step. Targets
|
| 7 |
+
> the two integration substrates already in this repo: the **PRIME-RL parity
|
| 8 |
+
> recipe** (`recipes/prime_rl/composer_loss.py`) and the **TRL `GRPOTrainer`
|
| 9 |
+
> subclass** (`trainer/composer_trainer.py`). Recommends the TRL subclass as the
|
| 10 |
+
> host and gives a ~70-LoC `ComposerGRPOTrainer` sketch.
|
| 11 |
+
> **Method:** Lead with local-file analysis of `loss.py`, `composer_loss.py`,
|
| 12 |
+
> `composer_trainer.py`, `data_collator.py`, plus `research/07` (HintGenerator)
|
| 13 |
+
> and `research/10` (the Dr. GRPO target). One bounded TRL API lookup
|
| 14 |
+
> (`mcp_exa_get_code_context_exa` on `huggingface/trl@main`) to confirm the
|
| 15 |
+
> `GRPOTrainer` loss-override surface; the DeepWiki follow-up timed out, so the
|
| 16 |
+
> version-robust guard in §4 documents both the `_compute_loss(self, model,
|
| 17 |
+
> inputs)` internal hook (what this repo already overrides) and the public
|
| 18 |
+
> `compute_loss(self, model, inputs, return_outputs=False,
|
| 19 |
+
> num_items_in_batch=None)` HF-Trainer wrapper.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## TL;DR
|
| 24 |
+
|
| 25 |
+
SDPO is **not** the GRPO-KL-to-reference term and must not be folded into it. It
|
| 26 |
+
is a **separate distillation head**: a generalized-JSD between the student's
|
| 27 |
+
on-policy logits and the **same model's** logits when its context has a hint
|
| 28 |
+
spliced in at the error turn, masked to the post-hint recovery tokens. The
|
| 29 |
+
integration is therefore "compute the Dr. GRPO loss as usual, then **add
|
| 30 |
+
`beta_sdpo · JSD_error_turns`** before `.backward()`."
|
| 31 |
+
|
| 32 |
+
- **Host = the TRL `GRPOTrainer` subclass.** It already exists
|
| 33 |
+
(`ComposerReplicationTrainer`), already overrides the loss with exactly this
|
| 34 |
+
`grpo + alpha*sdpo + beta*replay` shape, and — decisively — it has **full
|
| 35 |
+
logits** in `_compute_loss`. The PRIME-RL recipe **cannot** host SDPO today:
|
| 36 |
+
its `LossInputs` exposes per-token **log-probs only, not full vocabulary
|
| 37 |
+
logits**, and `composer_loss.py` correctly raises `NotImplementedError` when
|
| 38 |
+
`alpha_sdpo>0`. SDPO needs the full distribution; PRIME-RL is blocked until
|
| 39 |
+
upstream exposes logits.
|
| 40 |
+
- **Attach point:** inside the Dr. GRPO update step, after the policy-gradient +
|
| 41 |
+
k1-KL loss is computed on the minibatch, run one **student forward (grad)** +
|
| 42 |
+
one **teacher forward (`no_grad`, hint-spliced context)**, take
|
| 43 |
+
`generalized_jsd_loss` masked to `sdpo_loss_mask`, scale by `beta_sdpo`, and
|
| 44 |
+
add. Single-epoch Dr. GRPO makes this clean: the teacher forward happens on
|
| 45 |
+
**the same minibatch being updated**, so the KL is genuinely on-policy.
|
| 46 |
+
- **Dr. GRPO specifics are preserved untouched:** SDPO touches neither the
|
| 47 |
+
advantage estimator (no std-norm, no length-standardization) nor the GRPO
|
| 48 |
+
**k1** (`−log r`) KL-to-ref. It is purely additive.
|
| 49 |
+
- **CPU-testable:** a 1–2 rollout Dr. GRPO step on Qwen2.5-0.5B with the SDPO
|
| 50 |
+
channel on, mirroring the existing `examples/sdpo_real_trace_train_smoke`.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 1. The two in-repo substrates, and why TRL is the host
|
| 55 |
+
|
| 56 |
+
### 1.1 Substrate A — TRL `GRPOTrainer` subclass (`trainer/composer_trainer.py`)
|
| 57 |
+
|
| 58 |
+
Already in the repo and already the right shape. `ComposerReplicationTrainer`
|
| 59 |
+
subclasses `trl.GRPOTrainer` and overrides:
|
| 60 |
+
|
| 61 |
+
```python
|
| 62 |
+
def _compute_loss(self, model, inputs) -> torch.Tensor:
|
| 63 |
+
grpo_loss = super()._compute_loss(model, inputs) # channel 1
|
| 64 |
+
sdpo_kl = self._compute_sdpo_loss(model, inputs) # channel 2
|
| 65 |
+
replay_dpo = self._compute_trace_replay_loss(model, inputs)
|
| 66 |
+
return grpo_loss + self.alpha_sdpo*sdpo_kl + self.beta_replay*replay_dpo
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
`_compute_sdpo_loss` (lines 133–178) already does the **student forward (grad) +
|
| 70 |
+
teacher forward (`no_grad`) over `ctx_teacher_input_ids`**, the
|
| 71 |
+
`student_logits.shape == teacher_logits.shape` gate, and
|
| 72 |
+
`generalized_jsd_loss(..., labels=inputs["sdpo_loss_mask"], beta, temperature,
|
| 73 |
+
token_clip, reduction="batchmean")`. This is the SDPO channel, intact. **It has
|
| 74 |
+
full logits** — the prerequisite PRIME-RL lacks.
|
| 75 |
+
|
| 76 |
+
**Decisive property:** TRL hands the subclass `model` and `inputs` and lets it
|
| 77 |
+
return any scalar; full `.logits` are available for both the student and the
|
| 78 |
+
hint-conditioned teacher forward. SDPO is a drop-in.
|
| 79 |
+
|
| 80 |
+
### 1.2 Substrate B — PRIME-RL parity recipe (`recipes/prime_rl/composer_loss.py`)
|
| 81 |
+
|
| 82 |
+
PRIME-RL's `CustomLossConfig` takes an importable `loss_fn(inputs:
|
| 83 |
+
LossInputs)` called **once per sample** on **1-D `(seq,)` tensors**. Channel 1
|
| 84 |
+
(DPPO + k1-style KL on the importance ratio) is **byte-for-byte parity-verified**
|
| 85 |
+
against upstream `default_loss_fn` and is an excellent Dr.-GRPO-adjacent PG loss.
|
| 86 |
+
|
| 87 |
+
But SDPO is **deferred by construction**:
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
# composer_loss.py, lines 257-268
|
| 91 |
+
teacher_lp = getattr(inputs, "teacher_logprobs", None)
|
| 92 |
+
if alpha_sdpo > 0:
|
| 93 |
+
raise NotImplementedError(
|
| 94 |
+
"SDPO channel in the PRIME-RL recipe is deferred. PRIME-RL v0.5 "
|
| 95 |
+
"exposes (seq,) log-probs through LossInputs but not full vocabulary "
|
| 96 |
+
"logits, and SDPO/OPSD requires the full distribution. ...")
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
`generalized_jsd_loss` calls `log_softmax(dim=-1)` over the vocab axis. With
|
| 100 |
+
only a `(seq,)` log-prob vector there is **no vocab axis** — softmax of a
|
| 101 |
+
1-element slice is identically 1.0 and `log` is 0, i.e. a mathematically
|
| 102 |
+
degenerate, silently-zero channel (the Wave-13 finding the docstring cites). So
|
| 103 |
+
SDPO in PRIME-RL is blocked **until upstream exposes per-token full logits**, not
|
| 104 |
+
a thing we can paper over.
|
| 105 |
+
|
| 106 |
+
### 1.3 Recommendation
|
| 107 |
+
|
| 108 |
+
**Host the SDPO aux channel in the TRL `GRPOTrainer` subclass.** Rationale:
|
| 109 |
+
|
| 110 |
+
1. **Logits available** — the one hard requirement SDPO has and PRIME-RL lacks.
|
| 111 |
+
2. **The override already exists** with the exact additive shape; we
|
| 112 |
+
re-point channel 1 at Dr. GRPO and tighten the teacher forward (§4).
|
| 113 |
+
3. **Single-process, CPU-runnable** — matches the existing smoke harness, so the
|
| 114 |
+
SDPO-on Dr.-GRPO step is testable today (§6) without PRIME-RL's 3-actor mesh.
|
| 115 |
+
4. PRIME-RL stays the **scale/parity** path for channel-1-only runs; SDPO lands
|
| 116 |
+
there for free the moment `LossInputs.teacher_logits` (full distribution)
|
| 117 |
+
exists upstream — the adapter is otherwise ready.
|
| 118 |
+
|
| 119 |
+
> One caveat to fix while we're here: the current `ComposerReplicationTrainer`
|
| 120 |
+
> channel 1 is *vanilla* GRPO (`super()._compute_loss`). The Composer target is
|
| 121 |
+
> **Dr. GRPO** (`research/10`): length-standardization removed, **no std-dev
|
| 122 |
+
> advantage normalization**, **k1** (`−log r`) KL, Adam, single-epoch. §3 + §4
|
| 123 |
+
> pin those into the subclass; SDPO rides on top unchanged.
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## 2. The exact attach point + data flow
|
| 128 |
+
|
| 129 |
+
SDPO attaches **inside one Dr. GRPO update step, after the PG+KL loss is formed,
|
| 130 |
+
before backward**. It is one extra additive scalar. Concretely, per minibatch:
|
| 131 |
+
|
| 132 |
+
```
|
| 133 |
+
┌─────────────────────── one Dr. GRPO update step (single-epoch) ──────────────────────┐
|
| 134 |
+
rollout ──▶ │ Channel 1 (Dr. GRPO): │
|
| 135 |
+
trajectory │ advantages = (R - group_mean) # NO /std, NO length-standardization │
|
| 136 |
+
(group of K)│ logπ_new = model(input_ids).logprobs # the on-policy student forward (grad) │
|
| 137 |
+
│ log_r = logπ_new - logπ_old # log importance ratio (old = rollout-time) │
|
| 138 |
+
│ pg = -(advantages * exp(log_r))[resp_mask] │
|
| 139 |
+
│ kl = (-log_r)[resp_mask] # k1 estimator, NOT k3 │
|
| 140 |
+
│ L_drgrpo = (pg + beta_kl * kl).sum() │
|
| 141 |
+
│ │
|
| 142 |
+
│ Channel 2 (SDPO) — SAME minibatch, reuses the student forward where possible: │
|
| 143 |
+
│ error sites ◀── reuse ingestion structural `tool_error` (research/07 §5) │
|
| 144 |
+
│ │ (turn.get("tool_error") is not None; single source of truth) │
|
| 145 |
+
│ ▼ │
|
| 146 |
+
│ HintGenerator.generate(ErrorContext) ──▶ hint text (research/07 §6, layered) │
|
| 147 |
+
│ │ │
|
| 148 |
+
│ ▼ data_collator splices hint at the error turn: │
|
| 149 |
+
│ ctx_teacher_input_ids (hint system-msg + recovery turn, chat-template aligned) │
|
| 150 |
+
│ input_ids (placeholder-of-equal-token-length so shapes match) │
|
| 151 |
+
│ sdpo_loss_mask (1 on post-hint recovery tokens only) │
|
| 152 |
+
│ │ │
|
| 153 |
+
│ ▼ │
|
| 154 |
+
│ student_logits = model(input_ids).logits # grad │
|
| 155 |
+
│ with no_grad: teacher_logits = model(ctx_teacher_input_ids).logits # stop-grad │
|
| 156 |
+
│ L_sdpo = generalized_jsd_loss(student, teacher, │
|
| 157 |
+
│ labels=sdpo_loss_mask, beta=jsd_beta, │
|
| 158 |
+
│ temperature=1.0, token_clip=0.05) # masked to error turn │
|
| 159 |
+
│ │
|
| 160 |
+
│ total = L_drgrpo + beta_sdpo * L_sdpo ──▶ .backward() ──▶ Adam.step() │
|
| 161 |
+
└────────────────────────────────────────────────────────────────────────────────────────┘
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
Key flow facts:
|
| 165 |
+
|
| 166 |
+
- **Error-site detection is not re-invented.** The ingestion layer already sets
|
| 167 |
+
`turn["tool_error"] = <error_kind>` (structural `is_error:true` flag first,
|
| 168 |
+
string-tag fallback), and the collator's `_is_error_turn` keys on exactly that
|
| 169 |
+
(`research/07` §5). The trainer **consumes** the collator's
|
| 170 |
+
`ctx_teacher_input_ids` / `sdpo_loss_mask`; it does not detect errors itself.
|
| 171 |
+
- **HintGenerator is called at collation time**, not in the loss. Per
|
| 172 |
+
`research/07` §6.1, the generator's only job is to produce the text spliced
|
| 173 |
+
into the teacher context; the collator's `_build_hint_injected_trace` does the
|
| 174 |
+
splice and the equal-length student alignment
|
| 175 |
+
(`_build_aligned_student_for_sdpo`). The trainer sees finished tensors.
|
| 176 |
+
- **The teacher forward is on the live weights**, hint-conditioned, `no_grad`.
|
| 177 |
+
It is *not* a separate model and *not* a re-rollout (`research/07` §1.3). One
|
| 178 |
+
extra forward per SDPO minibatch.
|
| 179 |
+
- **The JSD is masked to the error turn** via `sdpo_loss_mask` (post-hint
|
| 180 |
+
recovery tokens only), so SDPO supervises *exactly* the turn the hint targets,
|
| 181 |
+
leaving the rest of the trajectory to channel 1.
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## 3. Reconciling with Dr. GRPO specifics
|
| 186 |
+
|
| 187 |
+
`research/10` pins the algorithm. SDPO must coexist without perturbing any of it:
|
| 188 |
+
|
| 189 |
+
| Dr. GRPO property (`research/10` §2) | Where it lives | SDPO interaction |
|
| 190 |
+
|---|---|---|
|
| 191 |
+
| **No std-dev advantage normalization** | advantage estimator | **None.** SDPO never touches advantages. Keep `A = R - group_mean` (no `/std`). |
|
| 192 |
+
| **Length-standardization term removed** | PG reduction | **None.** SDPO is a separate head; do not re-introduce a `1/|y|` factor via SDPO's reduction either (use `batchmean` over masked error-turn tokens, which is SDPO's own normalization, independent of trajectory length). |
|
| 193 |
+
| **k1 KL = `−log r`** (NOT k3) | GRPO KL-to-ref term | **Distinct from SDPO.** The GRPO k1 KL regularizes the policy toward the *reference/old* policy on all response tokens. SDPO's JSD pulls the policy toward the *hint-conditioned self-teacher* on error-turn tokens. Two different targets, two different token sets, two different weights (`beta_kl` vs `beta_sdpo`). Never merge them. |
|
| 194 |
+
| **Single-epoch (a prompt is never trained twice)** | outer loop | **This is what makes SDPO clean.** The teacher forward happens on the *same minibatch being updated this step* — the student logits and the hint-conditioned teacher logits are both from the current weights on the current rollout, so the distilled KL is genuinely **on-policy** (SDPO's defining property). No stale-teacher / replay-buffer drift to reconcile. |
|
| 195 |
+
| **Adam, full-parameter, async rollouts** | optimizer / infra | **None.** SDPO adds gradient only through the student forward; Adam consumes the summed gradient transparently. Async/off-policy weight sync (PipelineRL-style) affects channel 1's `logπ_old`; SDPO's teacher is the *current* weights so it is unaffected. |
|
| 196 |
+
|
| 197 |
+
**The one thing to get right:** SDPO's JSD is **SEPARATE** from the GRPO
|
| 198 |
+
KL-to-ref. In the loss expression `total = L_drgrpo + beta_sdpo*L_sdpo`, the
|
| 199 |
+
`L_drgrpo` already *contains* its own `beta_kl * k1_kl`. Do not let `beta_sdpo`
|
| 200 |
+
masquerade as a KL coefficient or vice-versa; they are logged separately
|
| 201 |
+
(`loss/grpo_kl` vs `loss/sdpo_jsd`).
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
## 4. Implementation handles — `ComposerGRPOTrainer(GRPOTrainer)`
|
| 206 |
+
|
| 207 |
+
A focused subclass that (a) forces channel 1 into the Dr. GRPO regime and (b)
|
| 208 |
+
adds the SDPO head. This refines the existing `ComposerReplicationTrainer`; the
|
| 209 |
+
SDPO method is lifted almost verbatim from `composer_trainer.py:_compute_sdpo_loss`
|
| 210 |
+
(it is already correct), and the Dr. GRPO config is pinned via `GRPOConfig`.
|
| 211 |
+
|
| 212 |
+
### 4.1 The loss-override surface (version-robust)
|
| 213 |
+
|
| 214 |
+
The repo already overrides `_compute_loss(self, model, inputs)` — the internal
|
| 215 |
+
per-step loss hook TRL's `GRPOTrainer` exposes, and what this subclass keeps
|
| 216 |
+
using. Recent TRL wraps that in the HF `Trainer.compute_loss(self, model,
|
| 217 |
+
inputs, return_outputs=False, num_items_in_batch=None)`. To be robust to either
|
| 218 |
+
surface, override **`_compute_loss`** (present across the versions this repo
|
| 219 |
+
targets) and additionally provide a thin `compute_loss` shim that delegates, so
|
| 220 |
+
the subclass works whether TRL calls the internal or the public method:
|
| 221 |
+
|
| 222 |
+
```python
|
| 223 |
+
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
|
| 224 |
+
loss = self._compute_loss(model, inputs) # our composed loss
|
| 225 |
+
return (loss, None) if return_outputs else loss
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
If a future TRL drops `_compute_loss`, move the channel-1 call to
|
| 229 |
+
`super().compute_loss(model, inputs, return_outputs=True,
|
| 230 |
+
num_items_in_batch=num_items_in_batch)[0]` inside `_compute_loss` — the SDPO
|
| 231 |
+
add-on is unaffected.
|
| 232 |
+
|
| 233 |
+
### 4.2 The sketch (~70 LoC)
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
# composer_replication/trainer/composer_grpo_trainer.py
|
| 237 |
+
from __future__ import annotations
|
| 238 |
+
from typing import Any
|
| 239 |
+
import logging, torch
|
| 240 |
+
|
| 241 |
+
try:
|
| 242 |
+
from trl import GRPOTrainer, GRPOConfig # noqa: F401
|
| 243 |
+
_TRL = True
|
| 244 |
+
except ImportError: # doc/test import without TRL
|
| 245 |
+
GRPOTrainer = object; _TRL = False
|
| 246 |
+
|
| 247 |
+
from composer_replication.opsd import generalized_jsd_loss
|
| 248 |
+
|
| 249 |
+
logger = logging.getLogger(__name__)
|
| 250 |
+
|
| 251 |
+
|
| 252 |
+
def make_dr_grpo_config(**overrides: Any) -> "GRPOConfig":
|
| 253 |
+
"""Dr. GRPO regime (research/10 §2): no std-norm, no length-standardization,
|
| 254 |
+
k1 KL, single-epoch, Adam. We pin what GRPOConfig exposes and assert the
|
| 255 |
+
rest. TRL flag names drift across versions, so set defensively + log."""
|
| 256 |
+
cfg_kwargs = dict(
|
| 257 |
+
num_iterations=1, # single-epoch: a prompt is never re-trained
|
| 258 |
+
scale_rewards=False, # << NO std-dev advantage normalization (Dr. GRPO)
|
| 259 |
+
loss_type="dr_grpo", # TRL's Dr. GRPO loss_type: drops length-standardization;
|
| 260 |
+
# if absent in your TRL, fall back to "grpo" and
|
| 261 |
+
# override the reduction (see assert below).
|
| 262 |
+
optim="adamw_torch", # Adam(W); Composer 2 uses Adam for RL
|
| 263 |
+
beta=0.0, # GRPO KL-to-ref coeff; set >0 to enable the k1 term
|
| 264 |
+
)
|
| 265 |
+
cfg_kwargs.update(overrides)
|
| 266 |
+
return GRPOConfig(**cfg_kwargs)
|
| 267 |
+
|
| 268 |
+
|
| 269 |
+
class ComposerGRPOTrainer(GRPOTrainer): # type: ignore[misc,valid-type]
|
| 270 |
+
"""Dr. GRPO + SDPO (on-policy KL at error turns). SDPO is an ADDITIVE head;
|
| 271 |
+
it never touches advantages or the GRPO-KL-to-ref term."""
|
| 272 |
+
|
| 273 |
+
def __init__(self, *args: Any, beta_sdpo: float = 0.0, sdpo_jsd_beta: float = 0.5,
|
| 274 |
+
sdpo_temperature: float = 1.0, sdpo_token_clip: float | None = 0.05,
|
| 275 |
+
sdpo_warmup_steps: int = 0, beta_sdpo_max: float | None = None,
|
| 276 |
+
**kwargs: Any):
|
| 277 |
+
if not _TRL:
|
| 278 |
+
raise ImportError("ComposerGRPOTrainer requires TRL: pip install -e .[train]")
|
| 279 |
+
super().__init__(*args, **kwargs)
|
| 280 |
+
self.beta_sdpo = beta_sdpo
|
| 281 |
+
self.beta_sdpo_max = beta_sdpo_max if beta_sdpo_max is not None else beta_sdpo
|
| 282 |
+
self.sdpo_warmup_steps = sdpo_warmup_steps
|
| 283 |
+
self.sdpo_jsd_beta = sdpo_jsd_beta
|
| 284 |
+
self.sdpo_temperature = sdpo_temperature
|
| 285 |
+
self.sdpo_token_clip = sdpo_token_clip
|
| 286 |
+
# Dr. GRPO sanity pins (loud, not silent): if the TRL version ignored a
|
| 287 |
+
# flag, surface it rather than train vanilla GRPO by accident.
|
| 288 |
+
if getattr(self.args, "scale_rewards", True):
|
| 289 |
+
logger.warning("Dr. GRPO requires scale_rewards=False (no std-norm); "
|
| 290 |
+
"GRPOConfig.scale_rewards=%s — advantages may be std-normalized.",
|
| 291 |
+
getattr(self.args, "scale_rewards", None))
|
| 292 |
+
|
| 293 |
+
def _beta_sdpo_now(self) -> float:
|
| 294 |
+
"""Linear warmup so SDPO doesn't swamp the early policy gradient (§5)."""
|
| 295 |
+
step = getattr(getattr(self, "state", None), "global_step", 0) or 0
|
| 296 |
+
if self.sdpo_warmup_steps <= 0:
|
| 297 |
+
return self.beta_sdpo
|
| 298 |
+
frac = min(1.0, step / float(self.sdpo_warmup_steps))
|
| 299 |
+
return self.beta_sdpo + frac * (self.beta_sdpo_max - self.beta_sdpo)
|
| 300 |
+
|
| 301 |
+
def _compute_loss(self, model, inputs):
|
| 302 |
+
drgrpo = super()._compute_loss(model, inputs) # channel 1 (Dr. GRPO, k1 KL)
|
| 303 |
+
sdpo = self._compute_sdpo_loss(model, inputs) # channel 2 (additive)
|
| 304 |
+
beta = self._beta_sdpo_now()
|
| 305 |
+
total = drgrpo + beta * sdpo
|
| 306 |
+
if self.state.global_step % getattr(self.args, "logging_steps", 50) == 0:
|
| 307 |
+
self.log({"loss/grpo": float(drgrpo.detach()),
|
| 308 |
+
"loss/sdpo_jsd": float(sdpo.detach()),
|
| 309 |
+
"loss/beta_sdpo": beta, "loss/total": float(total.detach())})
|
| 310 |
+
return total
|
| 311 |
+
|
| 312 |
+
def _compute_sdpo_loss(self, model, inputs):
|
| 313 |
+
if (self._beta_sdpo_now() == 0.0
|
| 314 |
+
or "ctx_teacher_input_ids" not in inputs
|
| 315 |
+
or inputs["ctx_teacher_input_ids"].numel() == 0):
|
| 316 |
+
return torch.zeros((), device=next(model.parameters()).device, requires_grad=True)
|
| 317 |
+
student = model(input_ids=inputs["input_ids"]).logits # grad
|
| 318 |
+
with torch.no_grad():
|
| 319 |
+
teacher = model(input_ids=inputs["ctx_teacher_input_ids"]).logits # stop-grad
|
| 320 |
+
if student.shape != teacher.shape: # collator alignment guard
|
| 321 |
+
logger.warning("SDPO shape mismatch student=%s teacher=%s; skipping step.",
|
| 322 |
+
student.shape, teacher.shape)
|
| 323 |
+
return torch.zeros((), device=student.device, requires_grad=True)
|
| 324 |
+
return generalized_jsd_loss(student_logits=student, teacher_logits=teacher,
|
| 325 |
+
labels=inputs.get("sdpo_loss_mask"),
|
| 326 |
+
beta=self.sdpo_jsd_beta, temperature=self.sdpo_temperature,
|
| 327 |
+
token_clip=self.sdpo_token_clip, reduction="batchmean")
|
| 328 |
+
|
| 329 |
+
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
|
| 330 |
+
loss = self._compute_loss(model, inputs)
|
| 331 |
+
return (loss, None) if return_outputs else loss
|
| 332 |
+
```
|
| 333 |
+
|
| 334 |
+
### 4.3 How error-turn batches reach the trainer
|
| 335 |
+
|
| 336 |
+
**Reuse `ComposerDataCollator` verbatim** — it already emits
|
| 337 |
+
`ctx_teacher_input_ids` + `sdpo_loss_mask` and (critically) the
|
| 338 |
+
**equal-length student** via `_build_aligned_student_for_sdpo` (the placeholder
|
| 339 |
+
trick that keeps `student_logits.shape == teacher_logits.shape` so the JSD gate
|
| 340 |
+
passes; the Gemini-W19 alias bug is already handled there). Wiring:
|
| 341 |
+
|
| 342 |
+
```python
|
| 343 |
+
gen = default_layered(judge_client=small_model).as_collator_hook() # research/07 §6
|
| 344 |
+
collator = ComposerDataCollator(tokenizer=tok,
|
| 345 |
+
config=CollatorConfig(hint_generator=gen, enable_sdpo=True,
|
| 346 |
+
enable_replay_dpo=False))
|
| 347 |
+
trainer = ComposerGRPOTrainer(model=model, args=make_dr_grpo_config(...),
|
| 348 |
+
train_dataset=ds, data_collator=collator,
|
| 349 |
+
beta_sdpo=0.1, sdpo_warmup_steps=50, sdpo_token_clip=0.05,
|
| 350 |
+
reward_funcs=[my_rlvr_reward])
|
| 351 |
+
```
|
| 352 |
+
|
| 353 |
+
> **GRPO-rollout vs collator note.** TRL's `GRPOTrainer` generates rollouts
|
| 354 |
+
> internally and forms its own `inputs` (prompt + completions + advantages). For
|
| 355 |
+
> SDPO the error sites come from the *rollout trajectory itself* (tool errors in
|
| 356 |
+
> the just-generated completions), so the SDPO tensors must be built **from the
|
| 357 |
+
> live rollout**, not from a static dataset. Two equivalent integration modes:
|
| 358 |
+
> (1) **post-rollout hook** — override `_generate_and_score_completions` (or the
|
| 359 |
+
> rollout collation step) to run the structural `tool_error` detector +
|
| 360 |
+
> `HintGenerator` + `ComposerDataCollator._build_sdpo_fields` on the generated
|
| 361 |
+
> completions and stash `ctx_teacher_input_ids`/`sdpo_loss_mask` into `inputs`;
|
| 362 |
+
> (2) **offline-trace mode** (what the smoke uses) — feed pre-ingested
|
| 363 |
+
> error-bearing traces through the collator as the dataset, exercising the exact
|
| 364 |
+
> loss path on CPU. Mode (2) is the test; mode (1) is production. The
|
| 365 |
+
> `_compute_sdpo_loss` body is identical for both — it only reads the two SDPO
|
| 366 |
+
> keys.
|
| 367 |
+
|
| 368 |
+
---
|
| 369 |
+
|
| 370 |
+
## 5. Weighting, scheduling, and guardrails
|
| 371 |
+
|
| 372 |
+
So SDPO informs without swamping the policy gradient:
|
| 373 |
+
|
| 374 |
+
1. **Scale.** Start `beta_sdpo = 0.1` (the library default `alpha_sdpo`), not the
|
| 375 |
+
`1.0` the smoke uses (the smoke over-weights deliberately to *prove the path
|
| 376 |
+
fires*). The Dr. GRPO PG loss is a `sum()` over response tokens; SDPO is a
|
| 377 |
+
`batchmean` JSD over error-turn tokens — different magnitudes. **Normalize
|
| 378 |
+
first:** log `loss/grpo` and `loss/sdpo_jsd` separately for the first ~50
|
| 379 |
+
steps and pick `beta_sdpo` so `beta_sdpo·sdpo_jsd ≈ 0.1–0.3 × |grpo|` at
|
| 380 |
+
steady state. Do **not** assume `0.1` is calibrated across reductions.
|
| 381 |
+
2. **Warmup.** Linear `beta_sdpo` warmup over `sdpo_warmup_steps` (50–200) via
|
| 382 |
+
`_beta_sdpo_now()`. Early in training the policy is far from any sensible
|
| 383 |
+
distribution; a strong distillation pull then fights exploration. Let Dr. GRPO
|
| 384 |
+
establish a reward signal, then ramp SDPO in.
|
| 385 |
+
3. **Per-token JSD clip = 0.05** (`sdpo_token_clip`, the OPSD `--jsd_token_clip`
|
| 386 |
+
default, `research/07` §1.1/§7). Prevents a few high-divergence **stylistic**
|
| 387 |
+
tokens at the error turn from dominating the distillation gradient — exactly
|
| 388 |
+
what it exists for.
|
| 389 |
+
4. **Mask discipline.** SDPO supervises **only** `sdpo_loss_mask` tokens
|
| 390 |
+
(post-hint recovery). If the mask is all-ignore (empty-recovery error site,
|
| 391 |
+
~67% of real Claude traces under `strip_thinking`), the collator already drops
|
| 392 |
+
the row (`data_collator.py` L308) — the channel silently no-ops rather than
|
| 393 |
+
emitting a degenerate ~ln(2) signal.
|
| 394 |
+
|
| 395 |
+
**KL-explosion / teacher-student-drift guardrails:**
|
| 396 |
+
|
| 397 |
+
- **SDPO drift is bounded-by-construction.** Teacher = same weights + hint,
|
| 398 |
+
stop-grad. A *wrong* hint produces a noisier target at one masked turn, not a
|
| 399 |
+
corrupted reward (`research/07` §1.3). There is no replay buffer and no
|
| 400 |
+
separate teacher to drift apart — single-epoch keeps teacher and student on the
|
| 401 |
+
same weights.
|
| 402 |
+
- **Watch `loss/sdpo_jsd` for collapse-to-zero or blow-up.** A *good* hint should
|
| 403 |
+
*raise* divergence at the hinted turn (it shifts mass toward the fix); a
|
| 404 |
+
persistently ~0 JSD means the hint isn't moving the teacher (prune that hint
|
| 405 |
+
source, `research/07` §7 item 7). A diverging JSD means the clip is too loose or
|
| 406 |
+
`beta_sdpo` too high — cap `beta_sdpo` and/or lower `token_clip`.
|
| 407 |
+
- **Guard the GRPO k1 KL independently.** Dr. GRPO's own `beta` (KL-to-ref) is
|
| 408 |
+
the explosion guard for the *policy*; keep it at its tuned value. SDPO's `beta_sdpo`
|
| 409 |
+
must not be conflated with it (§3). If total loss NaNs, bisect by zeroing
|
| 410 |
+
`beta_sdpo` — if it persists, the bug is in channel 1, not SDPO.
|
| 411 |
+
- **Shape-gate is a hard stop, logged.** If collator alignment regresses,
|
| 412 |
+
`_compute_sdpo_loss` skips the step with a warning rather than training on
|
| 413 |
+
aliased pad tokens (the silent-degenerate failure mode).
|
| 414 |
+
|
| 415 |
+
---
|
| 416 |
+
|
| 417 |
+
## 6. CPU-testable vs GPU-only, and the smoke plan
|
| 418 |
+
|
| 419 |
+
### What is CPU-testable
|
| 420 |
+
- **The whole SDPO loss path** — student forward + hint-conditioned teacher
|
| 421 |
+
forward + masked JSD + `.backward()` + `Adam.step()` — on **Qwen2.5-0.5B** with
|
| 422 |
+
1–2 error-bearing rollouts. This is *exactly* what
|
| 423 |
+
`examples/sdpo_real_trace_train_smoke/run.py` already proves for the free
|
| 424 |
+
`compose_loss` composer; the new test wraps it in the Dr. GRPO step.
|
| 425 |
+
- **The additive composition** `total = drgrpo + beta_sdpo·sdpo` and the warmup
|
| 426 |
+
schedule (assert `beta_sdpo` ramps, assert `loss/sdpo_jsd>0` on ≥1 step, assert
|
| 427 |
+
a watched param moves).
|
| 428 |
+
- **Dr. GRPO config pins** — assert `scale_rewards=False`, `num_iterations=1`,
|
| 429 |
+
k1-KL path selected (unit-level, no GPU).
|
| 430 |
+
|
| 431 |
+
### What is GPU-only
|
| 432 |
+
- **Real TRL `GRPOTrainer` rollout generation** (vLLM/transformers generation at
|
| 433 |
+
batch size + group size K) — too slow on CPU for a live step; this is the
|
| 434 |
+
production "mode (1)" path in §4.3.
|
| 435 |
+
- **Async weight sync / off-policy control**, MoE router replay, multi-region
|
| 436 |
+
infra (`research/10` §4) — all out of scope for the SDPO channel test.
|
| 437 |
+
- **Convergence / quality** (does SDPO actually improve error-recovery) — needs a
|
| 438 |
+
real RL run.
|
| 439 |
+
|
| 440 |
+
### Minimal smoke plan (`examples/sdpo_drgrpo_step_smoke/run.py`)
|
| 441 |
+
Analogous to the existing SDPO smoke; gates:
|
| 442 |
+
|
| 443 |
+
1. Build a Dr. GRPO minibatch from 1–2 ingested **error-bearing** Qwen traces via
|
| 444 |
+
`ComposerDataCollator` (reuse `_discover_error_sessions` + the layered
|
| 445 |
+
`HintGenerator`); assert `sdpo_loss_mask` has ≥1 in-loss position.
|
| 446 |
+
2. Construct a **synthetic Dr. GRPO channel-1 loss** standing in for
|
| 447 |
+
`super()._compute_loss` (advantages = `R - group_mean`, **no /std**; k1 KL
|
| 448 |
+
`−log r`; `sum()` reduction; no length-standardization) so the test runs
|
| 449 |
+
**without** spinning up TRL's full rollout machinery on CPU — mirrors how the
|
| 450 |
+
existing smoke uses LM-CE as the GRPO stub. Optionally also run a real
|
| 451 |
+
`GRPOTrainer._compute_loss` path under a `@pytest.mark.gpu` guard.
|
| 452 |
+
3. `total = drgrpo_stub + beta_sdpo · _compute_sdpo_loss(...)`; `.backward()`;
|
| 453 |
+
`Adam.step()`.
|
| 454 |
+
4. **Gates (exit 0 = PASS):** (a) all losses finite across steps; (b)
|
| 455 |
+
`loss/sdpo_jsd > 0` on ≥1 step (SDPO fired — shape-gate passed, hint
|
| 456 |
+
contributed real signal); (c) a watched parameter moved; (d) `beta_sdpo`
|
| 457 |
+
warmup increases monotonically; (e) zeroing `beta_sdpo` reproduces the pure
|
| 458 |
+
Dr. GRPO stub loss bit-for-bit (proves SDPO is purely additive). Exit 2 = SKIP
|
| 459 |
+
(no error-bearing sessions / no chat-template model), matching the existing
|
| 460 |
+
smoke's contract.
|
| 461 |
+
|
| 462 |
+
This is ~$0, CPU, single-process, and closes the one unproven edge: **a live
|
| 463 |
+
Dr. GRPO update step with the SDPO channel on**, end-to-end on a real HF model.
|
| 464 |
+
|
| 465 |
+
---
|
| 466 |
+
|
| 467 |
+
## 7. Citations
|
| 468 |
+
|
| 469 |
+
- **In-repo (authoritative substrate):** `composer_replication/loss.py`
|
| 470 |
+
(`compose_loss` 3-channel composer + `generalized_jsd_loss` call);
|
| 471 |
+
`recipes/prime_rl/composer_loss.py` (PRIME-RL adapter; SDPO `NotImplementedError`
|
| 472 |
+
at L257-268; parity-verified channel 1); `recipes/prime_rl/prime_rl_recipe.md`
|
| 473 |
+
(LossInputs shape, log-probs-not-logits limitation);
|
| 474 |
+
`trainer/composer_trainer.py` (`ComposerReplicationTrainer._compute_loss` and
|
| 475 |
+
`_compute_sdpo_loss` — the existing, correct SDPO head);
|
| 476 |
+
`trainer/data_collator.py` (`ctx_teacher_input_ids` + `sdpo_loss_mask` +
|
| 477 |
+
`_build_aligned_student_for_sdpo` equal-length alignment; hint-AND-recovery
|
| 478 |
+
gate L308); `examples/sdpo_real_trace_train_smoke/run.py` (the proven CPU
|
| 479 |
+
forward+backward+step harness this design's smoke extends).
|
| 480 |
+
- **`research/10-composer2-techreport-mining.md`** — the Dr. GRPO target:
|
| 481 |
+
length-standardization removed, no std-dev advantage normalization, **k1**
|
| 482 |
+
(`−log r`) KL not k3, Adam, single-epoch (a prompt never trained twice).
|
| 483 |
+
arXiv:2603.24477 §4.1.
|
| 484 |
+
- **`research/07-sdpo-hint-generator.md`** — `HintGenerator` Protocol + layered
|
| 485 |
+
composite, error-site detection alignment with ingestion `tool_error`, OPSD
|
| 486 |
+
`--jsd_token_clip` stabilizer, the "wrong hint is bounded-bad" property.
|
| 487 |
+
- **SDPO** — arXiv:2601.20802 (on-policy self-distillation; teacher = same model
|
| 488 |
+
conditioned on feedback, student stop-grad-free / teacher stop-grad, per-token
|
| 489 |
+
KL on the student trajectory). **OPSD** — arXiv:2601.18734 (privileged-info
|
| 490 |
+
teacher, generalized-JSD, token clip).
|
| 491 |
+
- **TRL** — `huggingface/trl@main` `trl/trainer/grpo_trainer.py`
|
| 492 |
+
(`GRPOTrainer` loss-override surface; confirmed via
|
| 493 |
+
`mcp_exa_get_code_context_exa`). The `_compute_loss(self, model, inputs)`
|
| 494 |
+
internal hook is what this repo already overrides; the public
|
| 495 |
+
`compute_loss(self, model, inputs, return_outputs=False,
|
| 496 |
+
num_items_in_batch=None)` HF-Trainer wrapper is shimmed in §4.1 for
|
| 497 |
+
version-robustness. (A confirmatory DeepWiki lookup timed out; the §4.1 guard
|
| 498 |
+
is written to work under either surface.)
|
| 499 |
+
```
|
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Composer 2.5 Blog — Delta Note (re-extraction)
|
| 2 |
+
|
| 3 |
+
> **Re-extraction date:** 2026-05-28.
|
| 4 |
+
> **Method:** Live re-pull of [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5) and [the Composer 2 technical-report blog](https://cursor.com/blog/composer-2-technical-report) via `mcp_tavily_tavily_extract` (advanced); arXiv abstract pulls for the three footnote-1 papers; one Tavily secondary-source sweep (Jake Handy, DataCamp, Pulse2, Kingy, TechTalks).
|
| 5 |
+
> **Scope:** DELTAS ONLY vs `docs/COMPOSER_RECIPE_MAPPING.md` (2026-05-25), focused on (a) data generation / synthetic tasks / data mix / CPT data, and (b) targeted-textual-feedback / on-policy distillation. Verbatim blog text in the mapping doc is **not** re-derived here.
|
| 6 |
+
|
| 7 |
+
## TL;DR
|
| 8 |
+
|
| 9 |
+
The **2.5 blog body is byte-for-byte unchanged** from what the mapping doc captured — no edits to the three method sections. All deltas come from (1) the **Composer 2 technical-report blog**, which is now cited as the K2.5 base-model source and which the mapping doc only listed as a stub "verify if needed," and (2) tighter sourcing on the three self-distillation papers. Net: a handful of real new facts, two corrections, and confirmation of the central reproducibility gap (hint generation) as still unstated.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## 1. Data generation / synthetic tasks / data mix / CPT — deltas
|
| 14 |
+
|
| 15 |
+
**Blog 2.5 verbatim sentences on data-gen (re-confirmed, unchanged from mapping doc):**
|
| 16 |
+
- *"To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run."*
|
| 17 |
+
- *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."*
|
| 18 |
+
- *"We use a range of approaches for creating synthetic tasks that are grounded in real codebases."*
|
| 19 |
+
- Feature-deletion paragraph + reward-hacking examples (Python type-checking cache; Java bytecode decompile) — verbatim as in mapping doc §2.
|
| 20 |
+
|
| 21 |
+
**DELTAS (not in / under-stated in COMPOSER_RECIPE_MAPPING.md):**
|
| 22 |
+
|
| 23 |
+
- **[DELTA — new emphasis]** The phrase *"we both **select for** and **create** harder tasks **dynamically throughout the run**"* is a **dynamic curriculum / online task-selection** signal. The mapping doc captured "Feature Deletion + 24 unnamed generators" but did **not** flag that task difficulty is filtered *online* (the model "begins to get most training problems correct," so hard tasks are up-weighted live). This is a data-*mix*/curriculum detail with direct replication impact: our generator suite needs a difficulty filter / pass-rate gate, not just a static task bank.
|
| 24 |
+
- **[DELTA — new authoritative source for CPT data mix]** The Composer 2 technical-report blog states the CPT data mix explicitly: *"continued pretraining on a data mix that **emphasizes code** to deepen the base model's coding knowledge"* and *"We find that **reducing pretraining loss improves downstream RL performance**, with better base knowledge reliably translating into a better agent."* The mapping doc marked "continued pretraining on heavily code-weighted data" as `[BLOG-VERIFIED]` from the 2.5 Muon section — but the **causal claim (CPT loss ↓ ⇒ RL performance ↑)** is new and is the stated *justification* for doing CPT at all. Relevant to our "skip CPT, start from Qwen3-Coder" decision: Cursor's own evidence says base-knowledge quality gates RL ceiling, which strengthens the case for starting from an already-code-tuned base.
|
| 25 |
+
- **[DELTA — new artifact]** There is now a **full Composer 2 arXiv technical report: [arXiv:2603.24477](https://arxiv.org/abs/2603.24477)** and a downloadable PDF at **`https://cursor.com/resources/Composer2.pdf`** (authored by Sasha Rush et al.). The report explicitly *"covers... ablations on the training recipe, our approach to agent behavior shaping, and the design of our evaluation suite."* The mapping doc cited only the blog stub and never the arXiv ID/PDF. **This PDF is the most likely place to resolve the data-mix weighting %, the RL algorithm name, and the hint-generation mechanism — none of which are in either blog.** → Recommend a dedicated follow-up extraction of Composer2.pdf.
|
| 26 |
+
- **[CONFIRM — "Anyrun"]** Mapping doc flagged "Anyrun" as possibly not Cursor-sourced. **Confirmed real:** the Composer 2 report blog says *"**Anyrun**, our internal compute platform for running hundreds of thousands of sandboxed coding environments."* It is a Composer-**2** artifact (carried into 2.5), correctly attributed. Resolves the mapping doc's open flag.
|
| 27 |
+
- **[NO CHANGE]** No data-mix percentages, token counts, generator inventory, or feature-deletion target-selection heuristic appear in either blog. Still the key data-gen reproducibility gap.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 2. Targeted textual feedback / on-policy distillation — deltas
|
| 32 |
+
|
| 33 |
+
**Blog 2.5 verbatim (re-confirmed, unchanged):** the full "Targeted RL with textual feedback" section — hint→teacher / original-context→student / on-policy KL "for that turn only," applied "to a variety of model behaviors, from coding style to model communication." Matches mapping doc §1 exactly.
|
| 34 |
+
|
| 35 |
+
**DELTAS / sharper detail:**
|
| 36 |
+
|
| 37 |
+
- **[DELTA — verbatim mechanism nuance the mapping doc compressed]** The blog's causal sentence is more specific than the mapping doc's paraphrase: *"This hint **changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement**. For that turn only, we then **update the student weights towards to the new probabilities**."* Two implementation facts to lift exactly: (i) the teacher distribution is the **hint-conditioned forward pass of the same weights** (not a re-rollout), and (ii) the **student weights are updated** (the KL is a gradient-bearing loss on the student, teacher is stop-grad). Mapping doc had this right conceptually; the verbatim confirms teacher = stop-grad, student = trainable.
|
| 38 |
+
- **[DELTA — secondary-source confirmation, not new fact]** Multiple write-ups (Pulse2, TechTalks) independently describe the mechanism identically ("injects a local textual hint... teacher distribution... student... on-policy KL"). No secondary source reveals **how the hint is generated** — the single most important replication gap from the mapping doc **remains unresolved** across all live sources. No source claims templates vs. LLM-judge vs. learned generator.
|
| 39 |
+
- **[DELTA — coverage breadth]** Blog explicitly lists the behavior targets as **coding style, tool use, and model communication** (Pulse2/DataCamp corroborate). Mapping doc noted style+communication; "tool use" as a distinct third target is worth recording for the v0.1 hint-template taxonomy.
|
| 40 |
+
- **[WATCH — likely secondary-source conflation, flag do-not-cite]** TechTalks (bdtechtalks, 2026-05-25) introduces an **"SDFT"** continued-pretraining self-distillation story ("model generates its own reasoning... distills its own generated logic... constrains weight shift... adapt to a company's coding style without forgetting"). **This is NOT in the Cursor blog** and conflates the footnote-1 *continual-learning* paper (2601.19897) with Composer's RL method. Treat as journalist embellishment, **not** Cursor-stated. Recorded here so a future reader doesn't mistake it for ground truth.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 3. Footnote-1 self-distillation papers — arXiv IDs resolve, one-line each
|
| 45 |
+
|
| 46 |
+
All three IDs **resolve live** (2026-05-28). Note the mapping doc's footnote ordering differs from the blog's; blog footnote-1 lists them in this order:
|
| 47 |
+
|
| 48 |
+
| arXiv | Title | Resolves? | One-line core method (abstract-level) |
|
| 49 |
+
|---|---|---|---|
|
| 50 |
+
| **[2601.19897](https://arxiv.org/abs/2601.19897)** | *Self-Distillation Enables Continual Learning* | ✅ v1 | Self-distillation as a continual-learning regularizer — anchor updates to the model's own prior outputs to acquire new data without catastrophic forgetting. (Abstract body not exposed in listing; title-level only.) |
|
| 51 |
+
| **[2601.20802](https://arxiv.org/abs/2601.20802)** | *Reinforcement Learning via Self-Distillation* (**SDPO**) | ✅ v2 (sub 28 Jan, rev 16 Feb 2026) | **The direct formalization of Composer's method.** "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy... converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model." |
|
| 52 |
+
| **[2601.18734](https://arxiv.org/abs/2601.18734)** | *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs* (**OPSD**) | ✅ v3, code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) | Single LLM; teacher = policy conditioned on privileged info, student = policy without it; per-token on-policy KL on the student's own rollouts. The original OPSD framework. |
|
| 53 |
+
|
| 54 |
+
**Corrections to the mapping doc:**
|
| 55 |
+
- **[CORRECTION — SDPO authorship]** Mapping doc says "Hübotter et al., 2026." **Confirmed and now precise:** Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, **Andreas Krause** (ETH Zürich group). 11 authors, CC-BY-4.0, v2 dated 2026-02-16.
|
| 56 |
+
- **[CORRECTION — SDPO abstract claim]** Mapping doc's quoted SDPO comparison-table framing ("environment / rich / on-policy") is a **reasonable gloss but not a verbatim abstract quote**. The actual abstract's strongest reproducibility-relevant claim is the **test-time** result: *"applying SDPO to individual questions at test time... achieving the same discovery probability as best-of-k... with 3x fewer attempts"* and that SDPO *"also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts."* The "successful-rollouts-as-implicit-feedback" trick is a **new lever** the mapping doc didn't capture — relevant if our hint generator is weak/absent (you can bootstrap hints from the model's own successful sibling rollouts). Benchmark: **LiveCodeBench v6**.
|
| 57 |
+
- **[CONFIRM — OPSD code]** `siyan-zhao/OPSD` link confirmed live in the arXiv comments field. The mapping doc's "lift the SDPO loss from OPSD (MIT)" plan stands; verify license on the repo directly (arXiv page shows CC-BY-4.0 for the *paper*, not necessarily the code).
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## 4. RL algorithm / reward-hacking / behavioral reward — deltas
|
| 62 |
+
|
| 63 |
+
- **[NO CHANGE — RL algorithm name still absent]** Neither blog names the outer RLVR algorithm (PPO/GRPO/DAPO). Mapping doc's "PPO or GRPO variant `[EXTRAPOLATED]`" verdict stands. The Composer 2 report blog adds only that *"RL training improves both **average and best-of-K** performance, suggesting the model is learning new solution paths rather than just concentrating on known ones"* — an **algorithm-agnostic** observation (best-of-K ↑ implies exploration, not just sharpening). **arXiv:2603.24477 / Composer2.pdf is the place to find the algorithm name.**
|
| 64 |
+
- **[DELTA — reward-hacking mitigation, slightly sharper]** Blog wording: hacks were found *"using **agentic monitoring tools**"* and they *"demonstrate the increasing care necessary for large scale RL."* Still no specifics (no static-analysis/sandbox-lockdown detail). Mapping doc's "build the monitor in v0.1" plan is unaffected; no new implementation handle.
|
| 65 |
+
- **[DELTA — behavioral reward framing]** 2.5 blog: *"we improved behavioral aspects of the model like **communication style and effort calibration**. These dimensions are not well captured by existing benchmarks, but we find that they matter for real-world usefulness."* Mapping doc captured this as `[BLOG-VERIFIED]`. Delta: the blog **strongly implies** (via "we applied this method to a variety of model behaviors, from coding style to model communication") that **behavioral rewards are trained via the targeted-textual-feedback channel itself**, not a separate RM. The Composer 2 report blog also promises *"our approach to **agent behavior shaping**"* as a report section → another reason to pull Composer2.pdf.
|
| 66 |
+
- **[DELTA — net-new context, out of scope but record]** 2.5 blog adds a **SpaceXAI / Colossus 2** paragraph: *"Together with SpaceXAI, we're training a significantly larger model from scratch, using **10x more total compute**. With Colossus 2's **million H100-equivalents**..."* This is a *future from-scratch* model, **not** Composer 2.5's recipe — irrelevant to replication but absent from the mapping doc; recorded so it isn't mistaken for a 2.5 training fact.
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Action items surfaced by this delta pass
|
| 71 |
+
|
| 72 |
+
1. **Pull `https://cursor.com/resources/Composer2.pdf` (arXiv:2603.24477)** — highest-value unread artifact; likely resolves data-mix %, RL-algo name, hint-generation, behavior-shaping. (Recommend a dedicated subagent.)
|
| 73 |
+
2. **Add an online difficulty filter / pass-rate gate** to the synthetic-task generator plan (the "select for... dynamically" delta), not just a static bank.
|
| 74 |
+
3. **Record the SDPO "successful-rollout-as-implicit-feedback" trick** as a hint-bootstrapping fallback for v0.1 when no external hint source exists.
|
| 75 |
+
4. **Update mapping doc citations**: resolve the Anyrun flag (confirmed), add arXiv:2603.24477 + Composer2.pdf, correct SDPO author list, add LiveCodeBench-v6 as SDPO's eval, and append a do-not-cite note on the TechTalks "SDFT" conflation.
|
| 76 |
+
5. Hint-generation mechanism **remains the #1 reproducibility gap** — unresolved by every live source checked.
|
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Composer 2 Technical Report — Mining Notes (arXiv:2603.24477)
|
| 2 |
+
|
| 3 |
+
> **Extraction date:** 2026-05-28.
|
| 4 |
+
> **Primary source:** Full text of the **Composer 2 Technical Report** (Cursor Research Team; corresponding author Alexander M. "Sasha" Rush), PDF at `https://cursor.com/resources/Composer2.pdf` and arXiv `2603.24477` (v1 25 Mar 2026, v2 26 Mar 2026; cs.SE / cs.LG; "Aaron Chan and 53 other authors").
|
| 5 |
+
> **Method:** `mcp_tavily_tavily_extract` (advanced) on the PDF returned the **complete report body incl. References + Appendices A–C** (~148 KB). Cross-checked against `mcp_exa_crawling_exa` (full re-pull, identical text) and a `mcp_tavily_tavily_search` confirming the arXiv ID, abstract, "Dr. GRPO" passage, and the technical-report blog.
|
| 6 |
+
> **Tagging:** **[REPORT-VERIFIED]** = verbatim/paraphrase from the arXiv report. **[SECONDARY]** = blog/third-party. **[ABSENT]** = explicitly looked for, not in the report.
|
| 7 |
+
> **Scope note:** This report is **Composer 2**, not Composer **2.5**. Several recipe items the 2.5 blog advertises (targeted textual-feedback/hint distillation, "25× synthetic tasks", Sharded Muon) are **not** in this document — see §3 and the corrections box.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## TL;DR — did it resolve the three open questions?
|
| 12 |
+
|
| 13 |
+
| Open question (from delta note 09) | Resolved? | Answer |
|
| 14 |
+
|---|---|---|
|
| 15 |
+
| **RL algorithm NAME** | ✅ **YES** | A multi-sample policy-gradient (GRPO-family) algorithm built explicitly on **Dr. GRPO** [34]: GRPO with the **length-standardization term removed** and **no std-dev advantage normalization**. Optimizer = **Adam**, single-epoch, fixed group size, full-parameter. KL via the **k1 estimator (−log r)**. |
|
| 16 |
+
| **Data-mix weighting % / generator inventory / token counts** | ⚠️ **PARTIAL** | CPT is a **3-phase code-dominated mix** (32k → 256k → SFT) but **no %s and no token counts** are given. RL task mix is given only as a **category histogram (Fig. 3)**, not generator names or weights. No "Feature Deletion" generator inventory (that was 2.5-blog). |
|
| 17 |
+
| **HINT-generation mechanism (targeted textual feedback)** | ❌ **ABSENT** | **The hint/teacher-student textual-feedback mechanism is NOT in the Composer 2 report at all.** It is a Composer **2.5** feature. Composer 2 shapes behavior with **auxiliary scalar rewards + a nonlinear length penalty**, not hint distillation. The #1 reproducibility gap remains unresolved by this artifact. |
|
| 18 |
+
|
| 19 |
+
**Net:** The report fully answers the RL-algorithm question (the single biggest win), partially answers data-mix, and does **not** touch hint generation. It also delivers a large amount of previously-unstated **infrastructure** detail (Anyrun internals, async RL stack, MoE router replay, precision recipe) and a **correction** to two prior assumptions (optimizer is Adam not Muon; base is Kimi K2.5 1.04T/32B).
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 1. Data generation / CPT data-mix / curriculum [§3, §4]
|
| 24 |
+
|
| 25 |
+
### 1.1 Continued pretraining (CPT) — [REPORT-VERIFIED]
|
| 26 |
+
- Base model = **Kimi K2.5** [67], a **1.04T-param / 32B-active MoE** (Appendix B; selected over GLM-5 and DeepSeek V3.2 on internal *FreshBench* knowledge, *State Tracking* (LoCoDiff-style), and *codebase perplexity*; agentic benchmarks **deliberately excluded** from base-model selection "as agentic and long-horizon capabilities can drastically change during the RL stage").
|
| 27 |
+
- CPT is **"a large code-dominated data mix"** done in **three phases**:
|
| 28 |
+
1. **Bulk of compute at 32k sequence length**,
|
| 29 |
+
2. a shorter **long-context extension phase to 256k**,
|
| 30 |
+
3. a short **SFT phase on targeted coding tasks**.
|
| 31 |
+
- Training: **MXFP8 on NVIDIA B300s**, **AdamW** optimizer. Eval loss on internal codebase **"decreases log-linearly"** over the run.
|
| 32 |
+
- **Causal CPT→RL claim (the justification for doing CPT):** they replicate the recipe on **Qwen3-Coder-30B-A3B** at **three log-spaced compute levels (small/medium/large)**, each + identical SFT + identical RL run, and show **"cross-entropy loss is … predictive of downstream RL performance"** (Fig. 2). → Direct support for our "start from an already-code-strong base" decision.
|
| 33 |
+
- **Multi-Token Prediction (MTP):** extra MTP layers [17,11] trained from scratch on the same mix for speculative decoding, via **self-distillation** to the main LM head's logits; MTP layers cut from the **middle** of the CPT run and trained jointly during the long-context + SFT phases. *(This is the only "self-distillation" in the report — it is for MTP/spec-decode, NOT for hints.)*
|
| 34 |
+
- **[ABSENT]** No data-mix percentages, no token/byte counts, no list of CPT data sources.
|
| 35 |
+
|
| 36 |
+
### 1.2 RL task distribution & dynamic curriculum — [REPORT-VERIFIED]
|
| 37 |
+
- RL tasks **"run in environments that emulate real Cursor sessions as closely as possible."** Problem distribution **"reflects the most common use cases"**; **Fig. 3** gives the category breakdown (x-axis "% of Problems", ~0–40%): **Iterate On Feature, Debugging, New Feature, Refactor, Understanding Codebase, Documentation, Testing, Code Review, Optimize, Devops, Migration, Deletion, Other.** *(This is the closest the report gets to a "data mix" — categorical, not weighted %s, no generator names.)*
|
| 38 |
+
- **Dynamic difficulty curriculum (verbatim):** *"In later stages of training, we use simple heuristics—such as **number of turns and thinking tokens of rollouts**—to **upsample increasingly harder data points**."* → Confirms delta note 09's "select for harder tasks dynamically" as an **online up-sampling gate keyed on turns + thinking-token count**. Replication handle: rank tasks by rollout length/turn-count, up-weight the long-tail late in training.
|
| 39 |
+
- **[ABSENT]** No synthetic-task **generator inventory** (no "Feature Deletion" et al.), no "25× synthetic tasks" figure, no synthetic-vs-real split. Those are Composer **2.5**-blog claims and are **not** in this report.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## 2. RL ALGORITHM [§4.1] — [REPORT-VERIFIED], the headline result
|
| 44 |
+
|
| 45 |
+
**Algorithm family:** *"a policy gradient algorithm with multiple samples per prompt [53 = DeepSeekMath/GRPO, 2 = REINFORCE-style RLOO] and a fixed group size."* Operates in the **single-epoch regime** (a prompt is **never trained on twice**). **Adam** optimizer; **full-parameter** update. Highly **asynchronous** (independent train + rollout workers).
|
| 46 |
+
|
| 47 |
+
**Specific GRPO modifications (the "name" + the deltas):**
|
| 48 |
+
- Built on **Dr. GRPO** [34 = Liu et al., *Understanding R1-Zero-like training*, arXiv 2503.20783]: verbatim *"As in Dr. GRPO, … crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage."*
|
| 49 |
+
- **Remove the length-standardization term from GRPO** (it "introduces a length bias").
|
| 50 |
+
- **Do NOT normalize group advantages by their standard deviation** — std-norm "results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."
|
| 51 |
+
- **Overlong-rollout masking [78 = DAPO/Yu et al.]: NOT used.** They *"did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length"*; the self-summary system limits overlong cases anyway. *(So: Dr. GRPO-style, explicitly NOT DAPO's overlong masking; DAPO [78] and GSPO [82] are cited but as related work / for router-replay, not adopted wholesale.)*
|
| 52 |
+
|
| 53 |
+
**KL regularization — exact formulation [§4.1, Fig. 4]:**
|
| 54 |
+
- Uses **KL(q‖p) = E_{x∼q}[−log r(x)], r(x)=p(x)/q(x)** for regularization (like DeepSeekMath [53] and Kimi k1.5 [66]).
|
| 55 |
+
- **Chooses the k1 estimator `k1 = −log r`** over the popular **k3 = (r−1) − log r** [Schulman 52], because (citing Amini et al. [6]) k3's variance "increases drastically as p and q diverge" — at large KL the k3 estimate variance is "extremely large." (k2 is unbiased-ish but biased per their note.) → **Replication handle: use the simple `−log r` KL penalty, not the k3 unbiased estimator, for agentic long-horizon RL.**
|
| 56 |
+
|
| 57 |
+
**Async-rollout infra / off-policy control [§4.1, §6.2]:**
|
| 58 |
+
- Minimize off-policyness via **fast weight sync + in-flight (mid-rollout) weight updates**, *"similar to **PipelineRL** [48]"* — inference workers update weights mid-rollout so later tokens are less off-policy.
|
| 59 |
+
- **MoE router replay [38, 82]:** inference engine returns selected expert indices per token per MoE layer; training forward pass **overrides the router's expert assignment to match** (router still computes gating scores so gradients flow). They **extend** replay by **filtering replayed experts whose gating scores fall below a plausibility threshold from the router's own top-k, replacing them with the router's candidates** — reduces p99 numerics mismatch between inference and training forward passes. *(Critical for MoE-base RL stability; directly relevant if we RL a MoE.)*
|
| 60 |
+
|
| 61 |
+
**Reward structure [§4.1–4.2]:**
|
| 62 |
+
- Reward based on **"code's correctness, succinctness, and conformance to software engineering principles."**
|
| 63 |
+
- **best-of-K does NOT trade off vs average:** both rise together over training (Fig. 5) → RL is *expanding* solution coverage, not just sharpening (notable vs the "RL only concentrates mass" literature [79,32,8,74,61]).
|
| 64 |
+
|
| 65 |
+
**Reward-hacking safeguards — [ABSENT/THIN]:** This report does **not** contain the Python-typecheck-cache / Java-bytecode reward-hack anecdotes (those are 2.5-blog). The only related safeguards here are **strict tool-argument checks** and **tool removal for steerability** in training environments (§6.2), and general monitoring for **emergent behaviors** (§4.2). No dedicated "agentic monitoring tool" section.
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## 3. Targeted textual feedback / hint distillation — **[ABSENT]**
|
| 70 |
+
|
| 71 |
+
**Finding: The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback / on-policy KL-to-hint-conditioned-teacher mechanism.** Searched the full text for hint / teacher / student / textual feedback / distill — the only "distillation" is **MTP self-distillation to the LM head's logits** (§3.1, spec-decode), unrelated to behavior shaping.
|
| 72 |
+
|
| 73 |
+
**What Composer 2 does for behavior shaping instead [§4.2 "Agent Behavior"] — [REPORT-VERIFIED]:**
|
| 74 |
+
- **Auxiliary scalar rewards**, not hints: *"we apply an array of auxiliary rewards … rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished."*
|
| 75 |
+
- **Reactive reward addition:** they "monitor the model for emergent behaviors and occasionally introduce additional behavior rewards" (examples observed: leaving long CoT in code comments; collapsing to terminal-tool-only).
|
| 76 |
+
- **Nonlinear length / effort penalty (exact equation):**
|
| 77 |
+
`C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))`, concave-down & increasing, where **x = a weighted combination of {thinking tokens, tool-calling tokens, tool-output tokens, final-message tokens, # tool calls, # turns}** and `k, q` are curvature hyperparameters (Fig. 6). Goal: be quick on easy tasks, think longer on hard tasks; observed to induce **parallel tool calls**.
|
| 78 |
+
- **Self-Summarization [§4.1, from Composer 1.5 [64]]:** rollouts are chains joined by self-summaries; **final reward is assigned to all tokens in the chain** (up-weights good agent turns *and* the summaries that enabled them; down-weights lossy summaries). Reduces error vs prompt-based compaction while using fewer tokens and reusing KV cache.
|
| 79 |
+
|
| 80 |
+
> **Implication for the replication framework:** To reproduce Composer **2.5**'s hint mechanism we still must look elsewhere — the **SDPO (arXiv 2601.20802) / OPSD (2601.18734)** papers from delta note 09 remain the only formalizations, and **how Cursor generates the hint text itself is still unstated in every Cursor artifact.** Composer 2's behavior shaping (auxiliary rewards + the length-penalty equation above) is a **fully reproducible, hint-free alternative** we can adopt for v0.1.
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## 4. Other replication-relevant detail [§6 Infrastructure, §5 CursorBench, App.] — [REPORT-VERIFIED]
|
| 85 |
+
|
| 86 |
+
**Optimizer — CORRECTION:** report says **AdamW (CPT) / Adam (RL)**. **There is NO "Sharded Muon" in the Composer 2 report** — the Muon claim came from the 2.5 blog and should be tagged 2.5-only / re-verified, not assumed for Composer 2.
|
| 87 |
+
|
| 88 |
+
**Parallelism / sharding layout — CORRECTION to "HSDP":**
|
| 89 |
+
- Prior stacks used **FSDP + EP + TP** (EP coupled to TP). **Composer 2 decouples EP from TP** and uses **Context Parallelism (CP)** as the primary long-context axis (less comm than TP; CP folded into the FSDP dim). **No mention of "HSDP"** — the doc says **FSDP/ZeRO [50,81] + CP + decoupled EP**, **DeepEP** [80] for token dispatch/combine.
|
| 90 |
+
- **Exact degrees:** **EP=8, CP=2 for CPT**; **EP=8, CP=8 for RL.** MLA attention with latent-vector all-gather trick; Llama-style 2×CP chunk load-balancing [33].
|
| 91 |
+
- **Global sequence packing** before each RL step to balance DP compute across variable-length rollouts (accounts for quadratic attention cost).
|
| 92 |
+
|
| 93 |
+
**Precision recipe [§6.1]:**
|
| 94 |
+
- **MoE forward = a novel NVFP4 variant**: BF16→FP4E2M1 with **FP8E4M3 per-block scales (block 16) + FP32 per-token scales** (per-tensor FP32 scales were "fragile" → batch-variance collapse + future-token leakage/biased grads). **MoE backward = standard MXFP8** (FP8E4M3 values, FP8E8M0 scales per 32-elt block) — afford higher precision since backward runs only on the train cluster. Trainer forward must **numerically match inference** for stability. IEEE `__fdiv_rn` **critical** for NVFP4 (fast-approx diverges ~100 RL steps); fast-approx OK for MXFP8.
|
| 95 |
+
- Kernels in **CUDA/PTX/ThunderKittens-ParallelKittens** [56,59]; FA4 backward (DeepSeek QK192/V128 shapes) co-developed w/ Colfax; GEMMs open-sourced into ThunderKittens [21].
|
| 96 |
+
|
| 97 |
+
**RL infra [§6.2] — 4 decoupled services (training / environments / inference / evals):**
|
| 98 |
+
- **Training:** fully async on **Ray [42] + PyTorch**, centralized **reconciler** w/ slot-based sample lifecycle + staleness-balancing scheduler; **futures**-based eager exec; Ray object store w/ NVMe spill; fault-tolerant to process-group level, warm-standby nodes, live code updates; **policy-aware rollout-level + group-level checkpointing** (codebase memory snapshots; advantage-tagged sequences w/ policy versions to NFS). Production run spanned **3 GPU regions + 4 CPU regions.**
|
| 99 |
+
- **Anyrun (verbatim internals):** *"an internal compute platform built for running untrusted code at scale … the same platform that powers Cloud Agents and Automations."* Global router → multiple Anyrun clusters; each cluster schedules **>500 pods/sec**, manages **hundreds of thousands of pods/cluster**; **each pod = a dedicated Firecracker VM** (full dev env incl. browser/GUI for computer use); x86+ARM mix; pressure-aware bin-packing. **Forking & snapshotting at filesystem + memory level** (→ mid-trajectory checkpoint, post-rollout introspection); same-node fork preferred else live-migrate. **Anygress** egress proxy (TCP-layer redirect via injected root CA, header stripping). **Shadow deployment of the Cursor backend** for faithful tools; tools dynamically per-environment (stricter arg checks / tool removal in training).
|
| 100 |
+
- **Inference:** **partner = Fireworks AI.** Every step, weights synced to inference via **S3 with per-rank delta compression** (RL diffs compress to "a handful of GB" for the 1T model); sharded upload/download; geo-distributed US+EU clusters reconstruct from the shared delta chain (no direct train↔inference connectivity).
|
| 101 |
+
- **Online evals:** pinned production backend + Cursor client per eval job; lease an eval deployment, move GPUs, cross-region weight sync.
|
| 102 |
+
|
| 103 |
+
**CursorBench (eval-suite design) [§5]:**
|
| 104 |
+
- Internal suite from **real Cursor engineering-team agent sessions** (avoids train-set contamination). Motivated by 4 failure modes of public benchmarks (domain mismatch, prompt over-specification, contamination/overfit, narrow scope).
|
| 105 |
+
- **Quantified hardness vs public sets:** median **181 lines changed** (vs 7–10 for SWE-bench Verified/Multilingual) and median prompt length **390 chars** (vs 1,185–3,055) → larger + more under-specified. Versioned (**CursorBench-3** > 2× the median task size of v1; Table 1 uses CursorBench-3).
|
| 106 |
+
- **Targeted sub-evals:** intent, instruction-following, **eager-editing** (don't edit when you shouldn't), code-quality (LLM-judge rubrics), **interruption** (mid-rollout user feedback). Built by "identifying dimensions, selecting eliciting data points, writing rubrics."
|
| 107 |
+
- **Headline results (Table 1):** Composer 2 = **CursorBench 61.3 / SWE-bench Multilingual 73.7 / Terminal-Bench 61.7**; Kimi K2.5 base = 36.0 / 65.1 / 47.3 → large RL+CPT lift.
|
| 108 |
+
|
| 109 |
+
**Ablations actually present (for "ablations on the training recipe"):**
|
| 110 |
+
1. **CPT→RL** (Qwen3-Coder-30B, 3 compute levels; Fig. 2) — CE loss predicts RL reward.
|
| 111 |
+
2. **KL estimator** k1 vs k3 (Fig. 4) — variance argument for k1.
|
| 112 |
+
3. **GRPO term removals** — length-standardization & std-norm removed (qualitative justification, no head-to-head curve).
|
| 113 |
+
4. **Overlong masking** — tried, no benefit at small scale, dropped.
|
| 114 |
+
5. **NVFP4 scaling scheme** (per-token vs per-tensor) and **IEEE vs fast-approx division** — stability ablations.
|
| 115 |
+
6. **best-of-K vs average** over training (Fig. 5).
|
| 116 |
+
*(No single consolidated "leave-one-out recipe component" ablation table; ablations are distributed and partly qualitative.)*
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## Corrections / cautions for the mapping doc
|
| 121 |
+
|
| 122 |
+
- **[CORRECTION] Optimizer:** Composer **2** uses **Adam/AdamW**, **not Muon**. Treat "Sharded Muon" as a **2.5-blog-only, unverified-for-2** claim.
|
| 123 |
+
- **[CORRECTION] Sharding:** report describes **FSDP+CP+decoupled-EP (EP=8/CP=2 CPT, EP=8/CP=8 RL)**, **not "HSDP."**
|
| 124 |
+
- **[CORRECTION] "RL algorithm = PPO/GRPO `[EXTRAPOLATED]`"** → now **[REPORT-VERIFIED] Dr. GRPO-style** (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE router-replay). DAPO overlong-masking explicitly rejected.
|
| 125 |
+
- **[CONFIRM] Anyrun** real, with full internals (Firecracker VMs, >500 pods/s, fork/snapshot, Anygress).
|
| 126 |
+
- **[CONFIRM] base model = Kimi K2.5 1.04T/32B** (over GLM-5, DeepSeek V3.2).
|
| 127 |
+
- **[CAUTION] Hint mechanism, "25× synthetic tasks", Feature-Deletion generator, reward-hack anecdotes are NOT in this (Composer 2) report** — do not cite this PDF for them; they are Composer 2.5-blog material.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## Sources
|
| 132 |
+
|
| 133 |
+
- **[PRIMARY, REPORT-VERIFIED]** Cursor Research Team, *Composer 2 Technical Report*, arXiv:**2603.24477** (v1 2026-03-25, v2 2026-03-26; cs.SE/cs.LG; corr. Alexander M. Rush). Full text via PDF `https://cursor.com/resources/Composer2.pdf` (Tavily advanced extract, full body+refs+App. A–C) and cross-checked via Exa full crawl (identical). HTML/TeX also available at `https://arxiv.org/abs/2603.24477`, `https://arxiv.org/pdf/2603.24477`.
|
| 134 |
+
- **[SECONDARY]** Cursor blog, *A technical report on Composer 2* (Sasha Rush) — `https://cursor.com/blog/composer-2-technical-report` (abstract-level; confirms Kimi K2.5 base + CPT-loss→RL claim).
|
| 135 |
+
- **[CONTEXT]** Key cited methods: Dr. GRPO (Liu et al., arXiv 2503.20783 [34]); DAPO (Yu et al. [78], 2503.14476/NeurIPS'25); GSPO (Zheng et al., 2507.18071 [82]); DeepSeekMath/GRPO [53]; PipelineRL (2509.19128 [48]); MoE router alignment (Ma et al., 2510.11370 [38]); KL-estimator variance (Amini et al. [6]); Schulman KL note [52]; DeepEP [80]; ThunderKittens/ParallelKittens [56,59].
|
| 136 |
+
- **Prior internal note:** `research/09-composer-blog-delta-2026.md` (read first; this note discharges its action item #1 and supplies corrections to the RL-algorithm/optimizer/sharding rows of `docs/COMPOSER_RECIPE_MAPPING.md`).
|