Codeseys commited on
Commit
6049d00
·
1 Parent(s): 7090729

research: Composer 2.5 data-gen + targeted-textual-feedback deep-research wave

Browse files

Phase-3 of the deep-work-loop bringing Composer 2.5's dataset-generation and
targeted-RL-with-textual-feedback methods into the framework. 5 new research
docs (130KB), all with primary-source citations:

- 06-feature-deletion-datagen.md: Feature Deletion env design (online pass-rate
difficulty gate, reward-hacking safeguards, 5 OSS substrates w/ HF ids +
licenses, deletion mechanics, FeatureDeletionEnv + TRL reward_fn adapter).
- 07-sdpo-hint-generator.md: layered HintGenerator (template -> raw-error ->
LLM-judge -> introspection -> learned -> SDPO sibling-bootstrap), actual
template strings, judge prompt, slots into existing CollatorConfig hook.
- 08-sdpo-grpo-integration.md: ComposerGRPOTrainer(GRPOTrainer) design adding
SDPO KL at error turns on top of Dr. GRPO; PRIME-RL recipe blocked (log-probs
only), TRL subclass is the host; CPU smoke plan.
- 09-composer-blog-delta-2026.md: blog re-extraction delta; found the Composer 2
arXiv tech report; SDPO successful-rollout-as-implicit-feedback lever.
- 10-composer2-techreport-mining.md: arXiv:2603.24477 mined. RESOLVED RL algo =
Dr. GRPO (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE
router-replay). Hint-gen confirmed ABSENT from every Cursor artifact ->
SDPO/OPSD reconstruction is the only path. Corrections: optimizer Adam not
Muon; sharding FSDP+CP+decoupled-EP not HSDP. Hint-free behavior-shaping
alternative (aux scalar rewards + nonlinear length/effort penalty eq).

research/06-feature-deletion-datagen.md ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Feature-Deletion Data Generation → `FeatureDeletionEnv` Design Brief
2
+
3
+ > **Author date:** 2026-05-28.
4
+ > **Scope:** Turn Composer 2.5's *Feature Deletion* synthetic-task approach (component **#2 "Synthetic data at 25× scale"**, mapping row **(b)**, reward-hacking row **(g)**) into a real, usable data-generation subsystem for this framework. This is the design brief that mapping-table row (b) calls for ("Build 1 generator (Feature Deletion) as OpenEnv-compatible env").
5
+ > **Method:** Live blog re-extraction (`mcp_tavily_tavily_extract` advanced) of [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5); substrate-dataset cards pulled live from HF/arXiv; TRL `GRPOTrainer` reward-fn convention confirmed against the [TRL source](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py).
6
+ > **Tag convention** (matches `docs/COMPOSER_RECIPE_MAPPING.md`): **`[BLOG-VERIFIED]`** = verbatim in the 2.5 blog; **`[INFERRED]`** = reasonable extrapolation from blog + open-source prior art; **`[EXTRAPOLATED]`** = our design addition, not Cursor-stated.
7
+ > **Reads-before:** `docs/COMPOSER_RECIPE_MAPPING.md` (§2, rows b/g) and `research/09-composer-blog-delta-2026.md` (online-curriculum delta). This file does **not** re-derive the Targeted-RL / SDPO material (that is rows (d) and `research/05`); it is the data-gen side only.
8
+
9
+ ---
10
+
11
+ ## 0. TL;DR
12
+
13
+ Feature Deletion is a **self-verifying inverse task**: take a repo whose test suite passes, *programmatically remove* a testable feature (so the suite now fails), and reward the agent for reimplementing it until the suite passes again. The reward is the pre-existing test suite — **verifiable, no human labels, no golden patch needed at reward time**. We can stand this up immediately on five open substrates (SWE-Gym, SWE-bench-Lite, R2E-Gym, SWE-rebench, OpenHands/Nemotron trajectories) by *inverting* their `(repo, base_commit, gold_patch, test_patch)` tuples instead of generating deletions from scratch. The two non-obvious requirements the blog forces on us: (1) an **online pass-rate difficulty gate** (the curriculum is dynamic, not a static bank — per the delta note), and (2) **anti-reward-hacking sandboxing** because Cursor observed the model recovering deleted signatures from bytecode/type-check caches. Below: a `FeatureDeletionEnv` Gym/OpenEnv class sketch wired for TRL `GRPOTrainer` (reward = test pass-fraction), the deletion mechanics (AST/file/coverage-mapped), the sandbox lockdown spec, and a CPU-pool cost model.
14
+
15
+ ---
16
+
17
+ ## 1. What "Feature Deletion" is, exactly `[BLOG-VERIFIED]`
18
+
19
+ Verbatim from the Synthetic-data section of the blog (re-pulled 2026-05-28):
20
+
21
+ > *"During RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, **we both select for and create harder tasks dynamically throughout the run**. Composer 2.5 is trained with **25x more synthetic tasks** than Composer 2.*
22
+ >
23
+ > *We use a range of approaches for creating synthetic tasks that are grounded in real codebases. For example, one synthetic approach is **feature deletion**. For these tasks the agent is given a codebase with a large set of tests, and asked to **delete code and files in such a way that the codebase remains functional while specific testable features are removed**. The synthetic task is to **reimplement the feature, and the tests are used as a verifiable reward**."*
24
+
25
+ **Parse of the mechanism (note the two-agent / two-phase structure the blog implies):**
26
+
27
+ | Phase | Actor | Action | Output |
28
+ |---|---|---|---|
29
+ | **Deletion (task-construction)** | a *deleter* (model or program) | "delete code and files in such a way that the codebase remains functional while specific testable features are removed" | a `broken_repo` + the set of tests that now fail |
30
+ | **Reimplementation (the training task)** | the *policy under training* | reimplement the deleted feature | a diff scored by the test suite |
31
+
32
+ - The deletion step is itself non-trivial: it must keep the codebase *otherwise functional* (imports resolve, unrelated tests still pass) while making **specific testable features** fail. `[BLOG-VERIFIED]` that this constraint exists; `[INFERRED]` that in practice this means *partition the test suite into a kept set (`PASS_TO_PASS`) that must still pass and a target set (`FAIL_TO_PASS`) that the deletion must break.*
33
+ - **Verifiable reward = the original test suite.** No golden patch is needed at reward time (only at task-construction time, to know what "done" looks like). This is the key property that makes the env cheap to run in an RL loop.
34
+ - The blog does **not** state: how deletion targets are *selected*, the deleter model, the languages beyond the Python/Java implied by the reward-hacking examples, or the difficulty heuristic. Those are the reproducibility gaps (consistent with `research/09` §1 "NO CHANGE" line).
35
+
36
+ **Relationship to the inverse-of-SWE-bench framing** `[INFERRED]`: A SWE-bench-style instance is `(repo@base_commit, problem_statement, gold_patch, test_patch, FAIL_TO_PASS, PASS_TO_PASS)`. Feature Deletion is the *constructive inverse*: instead of mining a human PR that fixed a bug, we **apply `revert(gold_patch)` (or an AST-deletion) to a passing repo** to manufacture the broken state, then ask the agent to re-derive `gold_patch`. This means **every existing SWE-* instance is already a ready-made Feature-Deletion task** — we get the deletion "for free" by reverting the gold patch. This is the single most important leverage point in this brief (see §4, §5).
37
+
38
+ ---
39
+
40
+ ## 2. The online difficulty curriculum `[BLOG-VERIFIED]` framing → `[EXTRAPOLATED]` design
41
+
42
+ The delta note (`research/09` §1, DELTA-new-emphasis) is explicit that *"select for and create harder tasks dynamically throughout the run"* is a **dynamic curriculum / online task-selection** signal, not a static bank. Our generator must therefore expose a **pass-rate gate**, not just emit tasks. Design:
43
+
44
+ **Difficulty signal.** For each candidate task `t`, maintain an EMA of the policy's group pass-rate `p̂(t)` (TRL GRPO already samples `G` completions per prompt — we get `G` pass/fail observations per task per step *for free*). Define difficulty `d(t) = 1 − p̂(t)`.
45
+
46
+ **Two levers the blog names — "select for" and "create":**
47
+
48
+ 1. **`select for` (online filter / replay weighting)** `[EXTRAPOLATED]`. Sampling weight over the task pool:
49
+ - **Drop solved tasks:** if `p̂(t) > τ_easy` (e.g. 0.9) for `k` consecutive evaluations, retire `t`. This is exactly the blog's "begins to get most training problems correct" symptom.
50
+ - **Drop impossible tasks:** if `p̂(t) < τ_hard` (e.g. 0.02) after `k` exposures, quarantine `t` (likely a broken-task or reward-hack-only task — see §3).
51
+ - **Up-weight the frontier:** sample weight `w(t) ∝ p̂(t)·(1−p̂(t))` (max variance ≈ max learning signal; standard curriculum-RL choice, cf. PLR / TD-error curricula). Keeps the policy on tasks it solves ~50% of the time.
52
+ 2. **`create` (difficulty escalation)** `[EXTRAPOLATED]`. When the pool's median `p̂` rises above a band, the *generator* produces harder tasks. Concrete escalation knobs, easiest→hardest:
53
+ - **Deletion span:** single-function → whole-class → whole-file → cross-file feature (more `FAIL_TO_PASS` tests, more LoC to reconstruct).
54
+ - **Hint starvation:** strip docstrings / type hints / the deleted function's signature from the surrounding context (also a reward-hack-surface reduction, §3).
55
+ - **Coupling:** delete a feature that several `PASS_TO_PASS` tests *also* exercise, so the agent must reconstruct it without breaking neighbors.
56
+ - **Multi-feature:** delete `n>1` independent features in one repo (reward = fraction of target tests passing — naturally graded).
57
+
58
+ **Implementation handle (curriculum is a thin layer over the task pool):**
59
+
60
+ ```python
61
+ # datagen/curriculum.py [EXTRAPOLATED]
62
+ import math, random, collections
63
+
64
+ class PassRateCurriculum:
65
+ """Online difficulty gate. Fed (task_id, n_pass, n_total) after each GRPO group."""
66
+ def __init__(self, tau_easy=0.90, tau_hard=0.02, ema=0.3, retire_k=3):
67
+ self.p = collections.defaultdict(lambda: 0.5) # EMA pass-rate
68
+ self.seen = collections.Counter()
69
+ self.retired, self.quarantined = set(), set()
70
+ self.tau_easy, self.tau_hard, self.ema, self.retire_k = tau_easy, tau_hard, ema, retire_k
71
+
72
+ def update(self, task_id, n_pass, n_total):
73
+ r = n_pass / max(n_total, 1)
74
+ self.p[task_id] = (1 - self.ema) * self.p[task_id] + self.ema * r
75
+ self.seen[task_id] += 1
76
+ if self.seen[task_id] >= self.retire_k:
77
+ if self.p[task_id] > self.tau_easy: self.retired.add(task_id)
78
+ elif self.p[task_id] < self.tau_hard: self.quarantined.add(task_id) # likely broken / hack-only
79
+
80
+ def weight(self, task_id):
81
+ if task_id in self.retired or task_id in self.quarantined: return 0.0
82
+ p = self.p[task_id]
83
+ return p * (1 - p) + 1e-3 # frontier (max-variance) weighting
84
+
85
+ def sample(self, task_ids, k):
86
+ live = [t for t in task_ids if self.weight(t) > 0]
87
+ w = [self.weight(t) for t in live]
88
+ return random.choices(live, weights=w, k=k) if live else []
89
+
90
+ def median_pass(self, task_ids): # escalation trigger for the generator
91
+ vals = sorted(self.p[t] for t in task_ids if t not in self.retired)
92
+ return vals[len(vals)//2] if vals else 0.0
93
+ ```
94
+
95
+ The trainer feeds `update(...)` from each GRPO group; the generator polls `median_pass(...)` and, when it crosses a band, emits a harder batch (more deletion span / more starvation). This is the minimal realization of "select for + create harder tasks dynamically."
96
+
97
+ ---
98
+
99
+ ## 3. Reward-hacking failure modes & programmatic safeguards `[BLOG-VERIFIED]` problem, `[EXTRAPOLATED]` mitigations
100
+
101
+ The blog (re-pulled verbatim) is the ground truth on the *problem*:
102
+
103
+ > *"One downstream consequence of large scale synthetic task creation is that it can cause unexpected reward hacking… In one example, the model found a **leftover Python type-checking cache and reverse-engineered the format to find a deleted function signature**. In another, it was able to **find and decompile Java bytecode to reconstruct a third-party API**. We were able to find and diagnose these problems using **agentic monitoring tools**, but they demonstrate the increasing care necessary for large scale RL."*
104
+
105
+ The blog gives **no mitigation specifics** beyond "agentic monitoring tools" (confirmed unchanged in `research/09` §4). So the following are our design `[EXTRAPOLATED]`, consistent with mapping row (g) ("Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find`/`strings`/`unzip`").
106
+
107
+ **Root cause:** Feature Deletion deletes *source*, but compilers/type-checkers/build tools leave **shadow copies of the deleted information** elsewhere in the working tree. The agent recovers the answer instead of reconstructing it. Two defense layers:
108
+
109
+ ### 3a. Pre-task scrubbing (eliminate the leak at construction time)
110
+ Run after deletion, before the repo is handed to the agent:
111
+
112
+ | Leak source | Scrub action |
113
+ |---|---|
114
+ | Python bytecode | delete all `**/__pycache__/`, `*.pyc`, `*.pyo` |
115
+ | Type-check caches | delete `.mypy_cache/`, `.pyre/`, `.pytype/`, `.dmypy.json`, `.ruff_cache/`, `.pytest_cache/` |
116
+ | Compiled Java/JVM | delete `*.class`, `target/`, `build/`, `*.jar`/`*.war` containing the deleted API; strip bundled deps |
117
+ | Build/dist artifacts | delete `dist/`, `*.egg-info/`, `*.so`, `build/`, `.tox/` |
118
+ | VCS history | run the agent on a **squashed, detached worktree** — no `.git` (else `git log -p` / `git show` recovers the deletion) |
119
+ | Editor/LSP indexes | delete `.idea/`, `.vscode/`, `*.code-workspace`, ctags/`tags`, `.cache/` |
120
+ | Docs/stubs | delete generated `*.pyi` stubs and built HTML/Sphinx docs that embed signatures |
121
+
122
+ ```python
123
+ # datagen/scrub.py [EXTRAPOLATED]
124
+ import shutil, pathlib
125
+ LEAK_DIRS = {"__pycache__",".mypy_cache",".pyre",".pytype",".ruff_cache",
126
+ ".pytest_cache","target","build","dist",".tox",".idea",".vscode",".git",".cache"}
127
+ LEAK_GLOBS = ["*.pyc","*.pyo","*.class","*.jar","*.war","*.so","*.pyi",
128
+ ".dmypy.json","tags","*.egg-info"]
129
+ def scrub(root: str):
130
+ root = pathlib.Path(root)
131
+ for p in root.rglob("*"):
132
+ if p.is_dir() and p.name in LEAK_DIRS:
133
+ shutil.rmtree(p, ignore_errors=True)
134
+ for g in LEAK_GLOBS:
135
+ for p in root.rglob(g):
136
+ (shutil.rmtree(p, ignore_errors=True) if p.is_dir() else p.unlink(missing_ok=True))
137
+ ```
138
+
139
+ ### 3b. Runtime sandbox lockdown (block recovery if a leak survives)
140
+ - **Tool denylist in the agent's shell harness** (matches mapping row g): no `find`, `strings`, `unzip`, `jar`, `javap`, `unzip`, `objdump`, `grep -r` over non-source dirs, `uncompyle6`/`decompyle3`/`pycdc`, `cfr`/`procyon`/`fernflower` (Java decompilers), `git`. Implement as an allowlisted command shim, not a blocklist (allowlist is the safe default).
141
+ - **Network egress = none** (can't `pip download` the original package to read the API). Already required for determinism.
142
+ - **Read-only mounts for everything except the editable source tree**; site-packages of the *target package itself* removed from the image.
143
+
144
+ ### 3c. Post-hoc monitoring ("agentic monitoring tools" analogue) `[EXTRAPOLATED]`
145
+ A cheap programmatic monitor over the trajectory, run *after* a rollout passes, to retro-reject hacks:
146
+ - **AST diff check:** the agent's accepted diff must contain *new function/class bodies* (AST nodes with statements), not just an import that re-exposes a surviving symbol. Reject solutions whose passing is explained purely by `import`/`from … import *` of a non-scrubbed module.
147
+ - **Provenance scan:** flag any trajectory whose tool calls touched `*.class`, `*.pyc`, `.mypy_cache`, `.git`, or invoked a denied binary (defense-in-depth telemetry even with the shim).
148
+ - **Static byte-similarity gate:** if the agent's reconstructed function is a near-exact byte copy of the (held-out) gold body *and* the agent never "wrote" it incrementally (single paste), flag for review — distinguishes reconstruction from recovery.
149
+ - These produce a **reward mask**: `reward = test_pass_fraction × (0 if hack_detected else 1)`. This is the concrete realization of mapping row (g)'s "+ RM-based penalty" without needing a learned RM in v0.1.
150
+
151
+ ---
152
+
153
+ ## 4. Open-source drop-in substrates
154
+
155
+ Every substrate below ships SWE-bench-shaped tuples `(repo, base_commit, patch=gold, test_patch, FAIL_TO_PASS, PASS_TO_PASS)`. **The Feature-Deletion mapping is identical for all of them: revert `patch` (or AST-delete the functions it touches) to manufacture `broken_repo`; `FAIL_TO_PASS` is the reward target; `PASS_TO_PASS` is the "stay-functional" guard the blog demands.** Licenses verified live 2026-05-28.
156
+
157
+ | Substrate | HF dataset id | Scale | What it provides | License | FD-env mapping |
158
+ |---|---|---|---|---|---|
159
+ | **SWE-bench / Lite / Verified** | `SWE-bench/SWE-bench`, `SWE-bench/SWE-bench_Lite`, `SWE-bench/SWE-bench_Verified` | 2,294 / 534 / 500 | Real GitHub issue→PR tuples, per-version test envs, pre-built Docker images (`xingyaoww/sweb.eval.*`, `swebench/*`) | dataset CC-BY-4.0; **per-repo source licenses vary** (mostly Apache/MIT/BSD) | Lite/Verified are the **v0.0 smoke-test set** (mapping row b: "use SWE-bench-lite only" in v0.0). Revert gold patch → FD task. Already-built images = no env-construction cost. |
160
+ | **SWE-Gym / SWE-Gym-Raw** | `SWE-Gym/SWE-Gym`, `SWE-Gym/SWE-Gym-Raw` | 2,438 (11 repos) / ~tens-of-k raw | Same schema as SWE-bench but **purpose-built for training** (train split, not a held-out benchmark → no contamination worry); pre-built Docker images; verifier-training support. arXiv:2412.21139 (ICML 2025). | check repo (SWE-Gym tooling MIT; **instances inherit upstream repo licenses**) | **Primary v0.1 FD substrate** (mapping row b: "build Feature Deletion"). 2.4k clean training tasks, each invertible into an FD task with `n` difficulty escalations (§2). |
161
+ | **R2E-Gym (V1 / Subset)** | `R2E-Gym/R2E-Gym-V1`, `R2E-Gym/R2E-Gym-Subset`, `R2E-Gym/SWE-Bench-Lite` | **8.1K** executable envs (13 repos); Subset = non-overlapping w/ SWE-bench | **SWE-GEN engine**: procedurally generates executable envs *directly from commits* w/o human issues, + execution-assisted back-translation for problem statements + **pre-built Docker images**. arXiv (R2E-Gym, Jain et al. 2025). | check repo (Apache-2.0 tooling typical; per-instance upstream licenses) | Best **scale** substrate and the closest open analogue to Composer's "grounded in real codebases" generator. Its commit-diffs *are* feature-deletion candidates by construction (the commit added a feature; revert = delete it). |
162
+ | **SWE-rebench** | `nebius/SWE-rebench` (+ `nebius/SWE-bench-extra`, `nebius/SWE-agent-trajectories`) | **21,336** tasks, 3,468 repos | Fully-automated mining pipeline; ships `install_config`, `requirements`, `environment`, `docker_image`, and **LLM-scored difficulty + clarity annotations** per task; `FAIL_TO_PASS`/`PASS_TO_PASS`/`FAIL_TO_FAIL`/`PASS_TO_FAIL`. arXiv:2505.20411 (NeurIPS 2025). | **dataset CC-BY-4.0**; per-instance `license_name` field provided (56 distinct) — *honor it per instance* | **Largest + the only one with built-in difficulty scores** → seeds the §2 curriculum's cold-start `p̂(t)` prior before any rollouts exist. The per-instance `docker_image` + `install_config` removes the 200-hr env-build bottleneck SWE-Gym reported. |
163
+ | **OpenHands trajectories** (via Nemotron-SWE-v1) | `nvidia/Nemotron-SWE-v1` | 59K agent trajectories | OpenHands-framework SWE trajectories (Qwen3-Coder-480B teacher), issues sourced from SWE-Gym + R2E-Gym-Subset | **CC-BY-4.0** (subsets BSD-3 / Apache-2.0 / MIT per viewer) — "ready for commercial use" | Not an FD-env itself — it's **SFT/distill warm-start + a source of gold trajectories** for the §3c monitor's "what legitimate reconstruction looks like" reference, and for `research/05` trace-replay. Use as cold-start, not as the RL env. |
164
+
165
+ **Practical selection:** v0.0 = SWE-bench-Lite (smoke). v0.1 = SWE-Gym (clean train) + SWE-rebench (scale + difficulty prior). R2E-Gym when we need to push past ~25k tasks toward the "25×" spirit. Nemotron/OpenHands trajectories = SFT warm-start + monitor reference, not the RL env. **License rule baked into the loader:** carry each instance's upstream repo license; filter out copyleft (GPL/AGPL) repos for any artifact we redistribute (we redistribute *deletions/diffs*, which are derivative works).
166
+
167
+ ---
168
+
169
+ ## 5. Deletion mechanics: producing the `(broken_repo, test_command, golden_diff)` tuple
170
+
171
+ Two construction paths; we implement both and let the curriculum pick granularity (§2).
172
+
173
+ ### Path A — Gold-patch reversion (cheap, the default for SWE-* substrates) `[INFERRED]`
174
+ The substrate already tells us *exactly* which lines implement a testable feature: the gold `patch`. So:
175
+ 1. `git apply patch` onto `base_commit` → **functional repo, all tests pass** (this is the substrate's "solved" state).
176
+ 2. `golden_diff := patch` (what the agent must re-derive); `broken_repo := apply(reverse(patch))` → the feature is gone.
177
+ 3. `FAIL_TO_PASS` tests now fail (target); `PASS_TO_PASS` tests still pass (the "remains functional" guard — verify this, see §5c).
178
+ 4. **Scrub** (§3a), strip `.git`, freeze image.
179
+
180
+ ### Path B — Coverage-mapped AST deletion (true synthetic generation, no human PR needed) `[EXTRAPOLATED]`
181
+ This is the path that generalizes beyond mined PRs and lets us "create harder tasks" at will (R2E-Gym-style):
182
+ 1. **Run the suite with coverage** (`coverage.py` / `pytest --cov`) on the passing repo to get a `test → {file:line-ranges}` map.
183
+ 2. **Pick a deletion target** by granularity knob:
184
+ - *function-level:* parse with `ast`/`libcst`, choose a `FunctionDef`/`AsyncFunctionDef`/`ClassDef` whose lines are covered by ≥1 test and that has high *test selectivity* (covered by few `PASS_TO_PASS` so the repo stays functional after removal).
185
+ - *file-level:* a module imported by exactly the target tests.
186
+ - *feature-level:* the transitive closure of a public symbol via an import/def graph (`grimp`/`pydeps`), bounded so unrelated tests survive.
187
+ 3. **Delete** via CST (replace body with `raise NotImplementedError` *or* remove the node and its now-dead imports). CST (`libcst`) preserves formatting and lets us re-insert a stub signature or not (the §2 "hint starvation" knob).
188
+ 4. **`golden_diff` = the removed nodes** (held out for the monitor; never shown to the agent).
189
+
190
+ ### 5c. Guaranteeing the remaining tests exercise the deleted code (the blog's hard constraint)
191
+ The blog requires *"the codebase remains functional while specific testable features are removed."* Enforce as a **construction-time validation gate** — a task is only emitted if all four hold:
192
+
193
+ ```python
194
+ # datagen/build_task.py [EXTRAPOLATED] (pseudocode over a sandboxed runner)
195
+ def validate_task(repo_passing, broken_repo, target_tests, keep_tests, run):
196
+ # 1. baseline sanity: full suite passes on the unbroken repo
197
+ assert run(repo_passing, target_tests + keep_tests).all_pass
198
+ res = run(broken_repo, target_tests + keep_tests)
199
+ # 2. deletion actually breaks the target feature (tests now FAIL)
200
+ assert all(res.failed(t) for t in target_tests) # FAIL_TO_PASS non-empty & failing
201
+ # 3. deletion left the rest functional (collection works, neighbors pass)
202
+ assert res.collected_ok and all(res.passed(t) for t in keep_tests) # PASS_TO_PASS guard
203
+ # 4. solvability: gold diff restores green (the task is actually achievable)
204
+ assert run(apply(broken_repo, golden_diff), target_tests + keep_tests).all_pass
205
+ return Task(...) # else discard
206
+ ```
207
+
208
+ Gate (4) is what prevents the §2 "impossible task" quarantine pile-up — every emitted task is provably solvable by `golden_diff`. Gate (3) is the literal encoding of "remains functional." **Task tuple emitted:**
209
+
210
+ ```python
211
+ # datagen/schema.py [EXTRAPOLATED]
212
+ from dataclasses import dataclass, field
213
+ @dataclass(frozen=True)
214
+ class FeatureDeletionTask:
215
+ task_id: str
216
+ repo: str # e.g. "getmoto/moto"
217
+ base_commit: str
218
+ broken_image: str # docker tag of the scrubbed broken repo (frozen env)
219
+ test_command: str # e.g. "python -m pytest -q"
220
+ fail_to_pass: tuple[str, ...] # reward target (must go red→green)
221
+ pass_to_pass: tuple[str, ...] # functional guard (must stay green)
222
+ golden_diff: str = field(repr=False) # HELD OUT — monitor/solvability only, never in obs
223
+ granularity: str = "function" # function|file|feature (curriculum escalation)
224
+ deleted_symbols: tuple[str, ...] = () # for AST-provenance monitor (§3c)
225
+ upstream_license: str = "unknown" # carried from substrate; gates redistribution
226
+ difficulty_prior: float = 0.5 # seeded from SWE-rebench LLM score if available
227
+ ```
228
+
229
+ ---
230
+
231
+ ## 6. `FeatureDeletionEnv` — OpenEnv/Gym-style design for TRL `GRPOTrainer` + verifiers
232
+
233
+ **Integration contract.** TRL's `GRPOTrainer` takes a dataset of prompts and one or more **reward functions** with the calling convention `reward_fn(prompts: list[str], completions: list[str], **kwargs) -> list[float]` (the dataset's non-prompt columns are passed through as `**kwargs`; confirmed against the [TRL `grpo_trainer.py`](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py) source and the `RewardFunc` type-alias fix in TRL PR #5246). So the env exposes **two faces**: a Gym/OpenEnv face (`reset`/`step` for multi-turn agentic rollout via OpenEnv, mapping row c) and a **`reward_fn` adapter** that GRPO calls directly. Reward = **test pass fraction** (`|FAIL_TO_PASS passing| / |FAIL_TO_PASS|`), naturally graded for multi-feature tasks, masked by the hack monitor (§3c).
234
+
235
+ ```python
236
+ # envs/feature_deletion_env.py [EXTRAPOLATED]
237
+ # Gym/OpenEnv-style env + a TRL GRPO reward adapter. Execution happens in the
238
+ # locked-down sandbox of §3b; this class is the orchestration shell.
239
+ from dataclasses import dataclass
240
+ from datagen.schema import FeatureDeletionTask
241
+
242
+ @dataclass
243
+ class StepResult:
244
+ observation: str # tool output / test stdout shown to the agent
245
+ reward: float # only nonzero on a terminal "submit" step
246
+ done: bool
247
+ info: dict
248
+
249
+ class FeatureDeletionEnv:
250
+ """One task per episode. Sandbox = allowlisted shell, no net, scrubbed tree (§3)."""
251
+ def __init__(self, sandbox, monitor, max_turns: int = 40):
252
+ self.sandbox, self.monitor, self.max_turns = sandbox, monitor, max_turns
253
+ self.task: FeatureDeletionTask | None = None
254
+
255
+ # ---- Gym/OpenEnv face (multi-turn agentic rollout) ----
256
+ def reset(self, task: FeatureDeletionTask) -> str:
257
+ self.task, self.turns = task, 0
258
+ self.sandbox.boot(task.broken_image) # read-only except editable src; egress off
259
+ # NOTE: golden_diff / deleted_symbols are NEVER placed in the observation.
260
+ return self._render_prompt(task) # task desc + failing-test names + tool list
261
+
262
+ def step(self, action: dict) -> StepResult:
263
+ self.turns += 1
264
+ if action["type"] == "submit" or self.turns >= self.max_turns:
265
+ return self._grade()
266
+ obs = self.sandbox.exec(action) # edit / run-tests / read-file (allowlisted)
267
+ return StepResult(obs, 0.0, False, {"turn": self.turns})
268
+
269
+ def _grade(self) -> StepResult:
270
+ r = self.sandbox.run_tests(self.task.test_command,
271
+ self.task.fail_to_pass + self.task.pass_to_pass)
272
+ frac = r.n_pass(self.task.fail_to_pass) / max(len(self.task.fail_to_pass), 1)
273
+ guard_ok = r.all_pass(self.task.pass_to_pass) # "remains functional"
274
+ hacked = self.monitor.flag(self.sandbox.trajectory(), # AST + provenance (§3c)
275
+ self.task.deleted_symbols)
276
+ reward = frac * (1.0 if guard_ok and not hacked else 0.0)
277
+ return StepResult(r.stdout, reward, True,
278
+ {"frac": frac, "guard_ok": guard_ok, "hacked": hacked})
279
+
280
+ # ---- TRL GRPOTrainer face (reward_fn(prompts, completions, **kwargs)->list[float]) ----
281
+ def reward_fn(self, prompts, completions, *, task_id=None, **kwargs):
282
+ rewards = []
283
+ for comp, tid in zip(completions, task_id): # task_id passed via dataset column
284
+ task = self.registry[tid]
285
+ self.reset(task)
286
+ res = self._run_completion(comp) # replay agent turns from `comp`
287
+ rewards.append(res.reward)
288
+ self.curriculum.update(tid, n_pass=int(res.reward > 0), n_total=1) # §2 feedback
289
+ return rewards
290
+ ```
291
+
292
+ **Wiring to GRPO (the dataset carries `task_id`; curriculum reweights the sampler):**
293
+ ```python
294
+ # train/grpo_fd.py [EXTRAPOLATED]
295
+ from trl import GRPOTrainer, GRPOConfig
296
+ env = FeatureDeletionEnv(sandbox=LockedSandbox(...), monitor=HackMonitor(...))
297
+ ds = build_prompt_dataset(tasks) # columns: prompt, task_id (+ curriculum weights)
298
+ trainer = GRPOTrainer(
299
+ model="Qwen/Qwen3-Coder-7B", # v0.0 base (mapping row a)
300
+ args=GRPOConfig(num_generations=8, ...), # G=8 → 8 pass/fail obs per task per step → §2
301
+ reward_funcs=[env.reward_fn], # reward = masked test pass-fraction
302
+ train_dataset=ds,
303
+ )
304
+ trainer.train()
305
+ ```
306
+ This slots into the same RLVR base as rows (c)/(d); the **SDPO hint-distill channel (row d, `research/05`) is orthogonal** and stacks on top — Feature Deletion supplies the *verifiable scalar reward* that SDPO's KL rides on. The `verifiers` library can wrap `reward_fn` for env composition if we run multiple generators.
307
+
308
+ ---
309
+
310
+ ## 7. Cost & feasibility at scale (CPU pools)
311
+
312
+ Feature-Deletion is **embarrassingly parallel and CPU-bound** — no GPU in the data-gen path (matches mapping §"Synthetic data: Generators run on CPU pool… Embarrassingly parallel"). Two cost buckets:
313
+
314
+ **(A) Task construction (one-time per task).** `[EXTRAPOLATED]` estimates:
315
+ - Path A (gold-patch revert) over a pre-built substrate image: `git apply -R` + scrub + one validation suite run. Dominated by the test run: **~30 s–5 min CPU** per task depending on suite size. Validation gate (§5c) needs ~2 suite runs (broken + gold-restored) → call it **~2–10 min CPU/task**.
316
+ - Path B (AST deletion): + a coverage run (~1.5–3× a normal suite run) + AST/CST manipulation (<1 s). **~5–20 min CPU/task.**
317
+ - **Throughput:** a 64-vCPU pool at ~8 min/task and 8 concurrent runners ≈ **~60 tasks/hr/node** → ~1,400 tasks/day/node. Inverting all 21k SWE-rebench instances ≈ **~15 node-days** on one 64-vCPU box, trivially parallel across nodes. Reaching a "25×-spirit" pool of ~50k–60k tasks (R2E-Gym 8.1k + SWE-rebench 21k + multi-feature/granularity escalations) is **<1 week on a modest CPU pool**.
318
+ - **Storage/images:** reuse substrate Docker images (SWE-Gym `xingyaoww/sweb.eval.*`, SWE-rebench per-instance `docker_image`) → **near-zero env-build cost**, sidestepping the "200 hours of manual env setup" bottleneck SWE-Gym reported. We only add a thin scrubbed overlay layer per task (~MBs).
319
+
320
+ **(B) Reward evaluation (recurring, in the RL loop).** This is the real running cost, not construction: each GRPO step runs `G` rollouts × (agent turns + final test run). Test execution is CPU; agent generation is the GPU/inference cost shared with rows (c)/(d). Levers: cache the broken image warm, run only `FAIL_TO_PASS + PASS_TO_PASS` (not the full suite), and retire solved tasks via §2 so we stop paying for tasks the model already aced ("begins to get most training problems correct").
321
+
322
+ **Feasibility verdict:** **Green.** Construction is cheap and one-time; the curriculum keeps the live pool small; the only nontrivial recurring cost (sandboxed test execution) is shared with any RLVR coding env we'd build anyway. The binding constraints are *engineering* (sandbox lockdown §3, validation gate §5c) and *licensing hygiene* (§4), not compute.
323
+
324
+ ---
325
+
326
+ ## 8. Open questions / reproducibility gaps (carried from blog silence)
327
+
328
+ 1. **Deletion-target selection heuristic** — blog silent (`research/09` §1 "NO CHANGE"). We propose coverage-selectivity (§5 Path B); Cursor's actual heuristic is unknown.
329
+ 2. **Deleter model vs. program** — blog implies an agent deletes ("asked to delete code… such that the codebase remains functional"); we default to *programmatic* deletion (cheaper, deterministic, no second model). An LLM-deleter is a v0.2 escalation.
330
+ 3. **The other ~24 generators** — Feature Deletion is "one synthetic approach… a range of approaches"; the rest are unnamed. Out of scope here; this brief delivers the one named generator.
331
+ 4. **"Agentic monitoring tools" internals** — unspecified; our §3c monitor is a best-effort programmatic stand-in.
332
+ 5. **Composer2.pdf (arXiv:2603.24477)** — flagged by `research/09` action-item #1 as the likely home of data-mix % and generator inventory; **not yet extracted**. Recommend a follow-up pull before scaling the generator suite.
333
+
334
+ ---
335
+
336
+ ## Sources
337
+
338
+ - **Cursor blog** — *Introducing Composer 2.5*, [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5) (re-extracted 2026-05-28; §1/§3 quotes verbatim).
339
+ - **Composer 2 technical report** — [arXiv:2603.24477](https://arxiv.org/abs/2603.24477) / [cursor.com/resources/Composer2.pdf](https://cursor.com/resources/Composer2.pdf) (unread; flagged in `research/09`).
340
+ - **SWE-bench** — datasets guide [swebench.com/SWE-bench/guides/datasets](https://www.swebench.com/SWE-bench/guides/datasets); HF `SWE-bench/SWE-bench`, `SWE-bench/SWE-bench_Lite`, `SWE-bench/SWE-bench_Verified`.
341
+ - **SWE-Gym** — *Training Software Engineering Agents and Verifiers with SWE-Gym*, [arXiv:2412.21139](https://arxiv.org/abs/2412.21139) (ICML 2025); HF [`SWE-Gym/SWE-Gym`](https://huggingface.co/datasets/SWE-Gym/SWE-Gym) (2,438 inst, 11 repos), `SWE-Gym/SWE-Gym-Raw`; [github.com/SWE-Gym/SWE-Gym](https://github.com/SWE-Gym/SWE-Gym).
342
+ - **R2E-Gym** — *Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents* (Jain et al. 2025); [r2e-gym.github.io](https://r2e-gym.github.io); HF org [huggingface.co/R2E-Gym](https://huggingface.co/R2E-Gym) (`R2E-Gym-V1`, `R2E-Gym-Subset`, `SWE-Bench-Lite`); 8.1K executable envs, 13 repos.
343
+ - **SWE-rebench** — *An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents*, [arXiv:2505.20411](https://arxiv.org/pdf/2505.20411) (NeurIPS 2025); HF [`nebius/SWE-rebench`](https://huggingface.co/datasets/nebius/SWE-rebench) (21,336 tasks, 3,468 repos, CC-BY-4.0 + per-instance `license_name`), `nebius/SWE-bench-extra`, `nebius/SWE-agent-trajectories`.
344
+ - **OpenHands trajectories** — HF [`nvidia/Nemotron-SWE-v1`](https://huggingface.co/datasets/nvidia/Nemotron-SWE-v1) (59K OpenHands trajectories, CC-BY-4.0; issues from SWE-Gym + R2E-Gym-Subset).
345
+ - **TRL `GRPOTrainer`** — reward-fn convention `reward_fn(prompts, completions, **kwargs)->list[float]`, [trl/trainer/grpo_trainer.py](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py), [`RewardFunc` alias PR #5246](https://github.com/huggingface/trl/pull/5246), [GRPO docs](https://huggingface.co/docs/trl/main/en/grpo_trainer).
346
+ - **Internal:** `docs/COMPOSER_RECIPE_MAPPING.md` (§2, rows b/g), `research/09-composer-blog-delta-2026.md` (online-curriculum delta), `research/05-trace-replay-distillation.md` (orthogonal distill channel).
research/07-sdpo-hint-generator.md ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SDPO Hint Generation: How to Build the Teacher's "Privileged Info" for Composer's Targeted RL with Textual Feedback
2
+
3
+ > **Research date:** 2026-05-28.
4
+ > **Scope:** Resolves the **#1 open replication question** flagged in `docs/COMPOSER_RECIPE_MAPPING.md` §1 and `research/09-composer-blog-delta-2026.md` §2: *how are the hints generated?* This doc maps OPSD/SDPO's "privileged information" onto Composer's "hint," builds a cheapest→richest **taxonomy of hint sources**, ships a **concrete template library with actual strings**, specifies the **LLM-judge fallback prompt**, aligns **error-site detection** with `ingestion/trace_examples.py`, and proposes a **layered `HintGenerator` design** that slots into the existing `CollatorConfig.hint_generator` hook.
5
+ > **Method:** Primary-source pulls (Tavily advanced) of the SDPO abstract + method/ablation HTML ([arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802)), the OPSD method HTML + GitHub README ([arXiv:2601.18734v3](https://arxiv.org/abs/2601.18734), [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)), audited against the current `hint_generator.py`, `trainer/data_collator.py`, and `ingestion/trace_examples.py`.
6
+
7
+ ---
8
+
9
+ ## TL;DR
10
+
11
+ The hint **is** the teacher's privileged-information conditioning variable. Cursor never says how hints are generated, but the two cited papers bracket the answer precisely:
12
+
13
+ - **OPSD** conditions the teacher on `y⋆` = **ground-truth answer / reference CoT** — the strongest, most "privileged" signal, available only when you hold the solution.
14
+ - **SDPO** generalizes this to **environment feedback** that you *already have for free* at training time, and crucially ablates **three feedback types**: (1) a **successful sibling rollout** ("sample solution"), (2) the **environment output** (runtime errors / judge text), and (3) the **student's own original attempt**. The teacher is the *same weights* conditioned on that feedback; the student is the same weights without it; loss is a per-token KL on the student's trajectory, gradient through the student only, teacher stop-grad.
15
+
16
+ Composer's "hint" is therefore **not one thing** — it is *whatever cheap, locally-available text shifts the teacher distribution toward the correct continuation*. That reframing makes the open question tractable: build a **layered generator** that tries the cheapest source first and escalates only on miss:
17
+
18
+ ```
19
+ template-by-error-kind → raw-tool-error-as-hint → LLM-judge hint → SDPO successful-sibling bootstrap
20
+ (free, deterministic) (free, structural) (~$0.0005/site) (free, needs a rollout group)
21
+ ```
22
+
23
+ The current `hint_generator.py` implements **only the first layer** (5 templates) and is the right v0.1 starting point. This doc specifies layers 2–4 and a clean `HintGenerator` Protocol so they compose behind the existing `Callable[[str, dict], str | None]` hook with **zero collator changes**.
24
+
25
+ ---
26
+
27
+ ## 1. How OPSD & SDPO obtain the teacher's "privileged info" → Composer's "hint"
28
+
29
+ Both methods build teacher and student from **a single LLM** and differ only in *what extra text the teacher gets to see*. That extra text is the privileged-information variable. Composer's "hint inserted into the local context" is exactly this variable.
30
+
31
+ ### 1.1 OPSD — privileged info = ground-truth answer
32
+
33
+ > *"The teacher policy is provided with privileged information `y⋆`, such as the **ground-truth answer or a reference chain-of-thought**, while the student policy conditions only on the problem `x`. … the teacher policy `p_T(·|x, y⋆)` conditions on both the problem and the privileged answer, whereas the student policy `p_S(·|x)` observes only the problem. We preserve the on-policy training paradigm by sampling trajectories `ŷ` exclusively from the student policy, which then receives dense, token-level supervision from the privileged teacher policy."* — OPSD, [arXiv:2601.18734v3](https://arxiv.org/html/2601.18734v3)
34
+
35
+ OPSD loss (Eq. 8, verbatim structure):
36
+
37
+ ```
38
+ L(θ) = E_{(x, y⋆)~S} [ E_{ŷ~p_S(·|x)} [ D( p_T ‖ p_S )(ŷ | x) ] ]
39
+ ```
40
+
41
+ > *"Gradients are backpropagated only through the student policy `p_S`, while the teacher `p_T` acts as a fixed full-distribution target conditioned on privileged information `(x, y⋆)`."*
42
+
43
+ **Map to Composer:** `y⋆` ≙ the **hint**. In OPSD the hint is maximally strong (the answer itself). In a coding agent you rarely have the answer at an arbitrary turn — so the OPSD form is the *upper bound* of hint strength, usable only for the subset of error sites where a reference exists (e.g. the deleted code in a Feature-Deletion task, or a known-good tool signature).
44
+
45
+ Two OPSD implementation handles that transfer directly (from the [GitHub README](https://github.com/siyan-zhao/OPSD)):
46
+ - **`--reason_first`**: *"Prepend an explicit rationalization to the teacher context before distillation."* This is OPSD's own knob for **same-model introspection** (taxonomy class (d) below) — the teacher is first asked to rationalize *why* the privileged info implies the answer, then distilled. Evidence the introspection-hint path is real and works.
47
+ - **`--jsd_token_clip`** (default `0.05`): *"Clip the JSD loss for each token … This can improve stability by preventing **stylistic tokens from dominating** the training signal."* Directly relevant to Composer's **style/communication** behavior targets — without clipping, distilling a style hint can be dominated by a few high-divergence stylistic tokens. Our collator's `sdpo_loss_mask` already isolates post-hint tokens; token-clipping is the complementary per-token stabilizer.
48
+
49
+ ### 1.2 SDPO — privileged info = environment feedback (three sources, ablated)
50
+
51
+ > *"SDPO treats the current model **conditioned on feedback** as a self-teacher and distills its feedback-informed next-token predictions back into the policy. … SDPO leverages the model's ability to **retrospectively identify its own mistakes in-context**."* — SDPO abstract, [arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802)
52
+
53
+ SDPO explicitly ablates **three feedback types present in a verifiable coding environment** ([SDPO method HTML](https://arxiv.org/html/2601.20802v2)):
54
+
55
+ > *"we ablate the three types of feedback present in a verifiable environment like code generation: **the sample solution** (if a successful rollout is available in the current rollout group), **the environment output** (such as runtime errors), and **the student's original attempt**."*
56
+
57
+ This is the load-bearing finding for our taxonomy. Each maps to a distinct hint source:
58
+
59
+ | SDPO feedback type | What it is | Composer "hint" equivalent | Taxonomy class (§2) |
60
+ |---|---|---|---|
61
+ | **Sample solution** | A *successful sibling rollout* from the same prompt's rollout group | Bootstrap hint: "Here is a working approach: …" | **(f)** SDPO successful-sibling bootstrap |
62
+ | **Environment output** | Runtime error / judge text returned by the env | Raw tool-error text spliced as the hint | **(b)** raw-tool-error-as-hint |
63
+ | **Student's original attempt** | The model's own failed text, re-shown | Self-introspection prompt | **(d)** same-model introspection |
64
+
65
+ The key SDPO lever for the **hint-absent case** (called out in `09-composer-blog-delta-2026.md` §3 action item 3):
66
+
67
+ > *"SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by **using successful rollouts as implicit feedback for failed attempts**."*
68
+
69
+ i.e. when there is **no external hint source**, you can still manufacture privileged info by letting the teacher condition on a *sibling rollout that passed*. This is free (you already paid for the rollout group under GRPO) and is the natural last fallback before giving up on an error site.
70
+
71
+ ### 1.3 The exact mechanism nuance to preserve (from the Cursor blog, via delta doc §2)
72
+
73
+ > *"This hint **changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement**. For that turn only, we then **update the student weights towards the new probabilities**."*
74
+
75
+ Two facts the hint generator must respect:
76
+ 1. **Teacher = hint-conditioned forward pass of the same weights** (not a re-rollout, not a separate model). The generator's job is only to *produce the text spliced into the teacher context* — the collator (`_build_hint_injected_trace`) already does the splicing, and the trainer does the forward pass.
77
+ 2. **Student weights are trainable; teacher is stop-grad.** The generator never touches the loss; it only conditions the teacher. So **a wrong hint is bounded-bad** — it produces a noisier teacher target at one masked turn, not a corrupted reward. This is why we can afford cheap/heuristic hints and only escalate on miss.
78
+
79
+ ---
80
+
81
+ ## 2. Taxonomy of hint sources — cheapest → richest
82
+
83
+ For each class: applicability, cost, and which **Composer behavior class** it covers. Composer's three stated behavior targets are **tool use, coding style, and model communication** (`09-composer-blog-delta-2026.md` §2), plus **effort calibration** (blog §"behavioral aspects"). Tool errors are the cheap, structural case; style/communication/effort are the hard cases templates can't reach.
84
+
85
+ | # | Hint source | How obtained | Cost / latency | Determinism | Tool err | Style | Comms | Effort |
86
+ |---|---|---|---|---|:--:|:--:|:--:|:--:|
87
+ | **(a)** | **Hardcoded template by error_kind** | Pattern-match `error_kind`, fill slots (`available_tools`, `tool_schema`) | **Free**, ~µs | Fully deterministic | ✅ strong | ⚠️ rigid | ⚠️ rigid | ❌ |
88
+ | **(b)** | **Raw tool-error text as hint** | Pass the env's error string through (optionally truncated) | **Free**, ~µs | Deterministic | ✅ strong | ❌ | ❌ | ❌ |
89
+ | **(c)** | **LLM-judge natural-language hint** | Call a cheap judge model with `(state, erroring_action, tool_output)` | ~$0.0003–0.001/site, ~0.5–2 s | Stochastic | ✅ | ✅ | ✅ | ✅ |
90
+ | **(d)** | **Same-model introspection** | Re-prompt the *training model* to critique its own failed turn (OPSD `--reason_first`) | **Free GPU** (1 extra gen), ~0.3–1 s | Stochastic | ✅ | ✅ | ✅ | ✅ |
91
+ | **(e)** | **Learned hint generator** | A small fine-tuned model trained to emit hints (defer to v0.2+) | Train-time cost + inference | Stochastic | ✅ | ✅ | ✅ | ✅ |
92
+ | **(f)** | **SDPO successful-sibling bootstrap** | Pick a *passing* rollout from the same prompt's GRPO group; condition teacher on it | **Free** (reuses rollout group), ~µs to select | Deterministic given group | ✅ | ✅ | ✅ | ✅ (shows a *terser* success) |
93
+
94
+ **Reading of the table:**
95
+ - **(a)+(b) cover the tool-use behavior class almost entirely** and are free + deterministic → make them the default first layer. This is the "easy case" the mapping doc warns about (`COMPOSER_RECIPE_MAPPING.md` §"Why deferring … is the right call", point 2): they *do not* validate the harder behavior cases.
96
+ - **Style / communication / effort-calibration are NOT pattern-matchable.** "This explanation was wasteful" or "this code violates house style" requires class **(c)**, **(d)**, or **(f)**. This is the real content of the open question.
97
+ - **(f) is the unique unlock** when no external hint source exists *and* you don't want an API call: it manufactures privileged info from the model's own successes. It is the natural fallback and also the cheapest source for style/comms/effort because a *successful sibling* implicitly demonstrates the desired style/terseness without anyone writing a rule.
98
+ - **(e) learned generator** is explicitly v0.2 (`COMPOSER_RECIPE_MAPPING.md` table row (d): "+ learned hint generator"). Out of scope to build now; the Protocol below makes it a drop-in later.
99
+
100
+ **Recommended escalation order (rationale):** deterministic-and-free before stochastic, structural before semantic, no-API before API. → `(a) → (b) → [(c) xor (d)] → (f)` with `(f)` as the "nothing else fired but we have a passing sibling" backstop.
101
+
102
+ ---
103
+
104
+ ## 3. Concrete template library (actual strings)
105
+
106
+ This **extends** the current `hint_generator.py` registry (which already ships `tool_not_found`, `json_decode`, `type_error`, `runtime_error`, `repeated_failure`). New/expanded templates below are written to the **same `HintContext` TypedDict** and same `dispatch(error_kind, ctx)` contract, so they register without touching the collator. All keep the blog's *"Reminder: …"* register (the one verbatim example Cursor published was `"Reminder: Available tools are…"`).
107
+
108
+ | error_kind | Trigger | Hint string (template) |
109
+ |---|---|---|
110
+ | `tool_not_found` | invalid tool name | `Reminder: Available tools are: {tool_list}. The tool you called does not exist — use one of these.` |
111
+ | `malformed_args` / `json_decode` | unparseable tool args / JSON | `Reminder: tool arguments must be a single valid JSON object. Common mistakes: single quotes (use double quotes), trailing commas, unescaped newlines inside strings, or wrapping the JSON in markdown fences.` |
112
+ | `schema_mismatch` / `type_error` | args parse but violate schema | `Reminder: \`{tool_name}\` expects arguments matching this schema:\n {tool_schema}\nYour call is missing/mistyped: {bad_fields}. Re-issue with arguments matching the schema.` |
113
+ | `failing_test` | test suite returns non-zero / assertion | `Reminder: the test \`{test_name}\` is still failing: {assertion_excerpt}. Re-read the failing test's expectations and adjust the implementation to satisfy them — do not modify the test.` |
114
+ | `lint_style` | linter/formatter non-zero exit | `Reminder: this change violates the project style ({linter}: {rule_id} — {rule_msg}). Match the surrounding code's conventions (imports, naming, formatting) before proceeding.` |
115
+ | `wasteful_action` | redundant/no-op action (effort calibration) | `Reminder: this step repeated work already done (you already {prior_action}). Skip redundant reads/searches and act on what you know; prefer the most direct path to the goal.` |
116
+ | `repeated_failure` | same error_kind ≥3× consecutively | `Reminder: this approach has failed {n} times. Step back and try a different strategy: read more of the surrounding code, search for an existing working example, or decompose the task differently.` |
117
+ | `verbose_communication` | judge-flagged over-long message (comms) | `Reminder: keep the response concise and focused on the user's request. State what you did and why in 1–2 sentences; omit restating the task and step-by-step narration.` |
118
+
119
+ Notes:
120
+ - `{tool_list}`, `{tool_schema}`, `{bad_fields}`, etc. are filled from `HintContext` (`available_tools`, `tool_schema`, `tool_name`) and from new optional keys (`test_name`, `assertion_excerpt`, `linter`, `rule_id`, `rule_msg`, `prior_action`, `n`).
121
+ - `failing_test`, `lint_style`, `wasteful_action`, `verbose_communication` are **new** and extend coverage from tool-use into the style/comms/effort behavior classes at the *template tier* — but they are deliberately generic; the high-quality versions of these come from the LLM-judge (§4) or sibling-bootstrap (§2 class (f)).
122
+ - Truncate `{assertion_excerpt}` / `{tool_schema}` to ~200 chars (matches the `source_content_excerpt[:200]` convention already used in `trace_examples.py`) to keep the injected hint short — the blog stresses the hint is **local and short**.
123
+
124
+ ---
125
+
126
+ ## 4. LLM-judge path (class (c))
127
+
128
+ When no template fires, or when the behavior class is style/comms/effort, call a cheap judge to emit a ≤2-sentence corrective hint. The judge sees the *failed* turn and the *environment's* reaction — never the ground truth (we usually don't have it) — and is asked to produce the *minimal corrective nudge* that the teacher will then condition on.
129
+
130
+ ### 4.1 Prompt template
131
+
132
+ ```text
133
+ SYSTEM:
134
+ You write a single, short corrective hint for a coding agent that just made a
135
+ mistake. The hint will be inserted into the agent's context so it can retry the
136
+ SAME turn. Output ONE hint of AT MOST 2 sentences. Be concrete and actionable.
137
+ Do NOT solve the task, do NOT write code, do NOT explain your reasoning. If the
138
+ action was actually fine, output exactly: NO_HINT.
139
+
140
+ USER:
141
+ ## Conversation state (last {k} turns)
142
+ {state}
143
+
144
+ ## The action that went wrong
145
+ {erroring_action}
146
+
147
+ ## What the environment returned
148
+ {tool_output}
149
+
150
+ ## Behavior dimension to correct (one of: tool_use | style | communication | effort)
151
+ {behavior_dim}
152
+
153
+ Write the hint now (≤2 sentences, or NO_HINT):
154
+ ```
155
+
156
+ - `{state}` = last `k≈3` turns (truncate to a token budget, e.g. 1.5k tokens).
157
+ - `{erroring_action}` = the assistant turn's tool call / message that failed.
158
+ - `{tool_output}` = the env error or judge text (the same string class (b) would pass raw).
159
+ - `{behavior_dim}` = routed from the error-site detector (§5): structural tool errors → `tool_use`; judge-flagged turns → `style`/`communication`/`effort`.
160
+ - `NO_HINT` sentinel maps to the generator returning `None` (skip the SDPO site), preventing the collator from minting a zero-signal row (the collator already guards "hint AND recovery content" — `data_collator.py` L308).
161
+
162
+ ### 4.2 Model tier & rough cost
163
+
164
+ - **Tier:** a *small/cheap* instruct model is sufficient — the task is "spot the obvious mistake and say it in 2 sentences," not solve. Candidates: a 7–8B local model already loaded for rollouts (zero marginal $), or a hosted small model (e.g. Sonnet-class / GPT-mini-class via OpenRouter, consistent with the existing `hint_generator.py` docstring that names "Sonnet 4.6 or Opus 4.7 via OpenRouter" for v0.2).
165
+ - **Cost (hosted small model):** input ≈ 1.5k–2k tok (state + action + output), output ≤ ~60 tok. At ~$0.15/M in + ~$0.60/M out that is **≈ $0.0003–0.0006 per error site**. With error sites at, say, 1–3 per trace, this is **~$0.001–0.002/trace** — an order of magnitude cheaper than the trace-replay channel's ~$0.30/trace (`COMPOSER_RECIPE_MAPPING.md` §"three reward channels"), and only paid on **template misses**.
166
+ - **Cost (local judge):** effectively free GPU time; preferred at scale. Use the hosted path for v0.1 quality calibration, then distill to local.
167
+ - **Caching:** hints are deterministic-enough to cache keyed on `hash(erroring_action + tool_output + behavior_dim)`; repeated identical error sites across a training run reuse the hint for free.
168
+
169
+ ---
170
+
171
+ ## 5. Error-site detection in a trace (align with `ingestion/trace_examples.py`)
172
+
173
+ The hint generator must only fire at **error sites**. The pipeline already has two layers of structural detection that the generator must align with — do **not** invent a parallel detector.
174
+
175
+ **Existing structural detection (authoritative, do not duplicate):**
176
+ 1. **Ingestion → `trace_examples.py`** sets `turn["tool_error"] = <error_kind>` on the assistant turn *immediately after* an error tool-result. It detects errors via:
177
+ - **Structural flag first** (`_user_turn_has_error`): the ingester sets `tool_error: True` on user messages whose source JSONL had `is_error: true`. **This is the source of truth.**
178
+ - **String-tag fallback**: matches `TOOL_ERROR_TAG = "[TOOL_RESULT (ERROR)]"` only when no structural flag is present (older traces).
179
+ - **`error_kind` classification** (`default_classify_error`): keyword regex → `command_not_found`, `file_not_found`, `permission_denied`, `syntax_error`, `connection_error`, else `tool_error`.
180
+ 2. **Collator → `data_collator.py`** (`_is_error_turn`): an error site iff `turn.get("tool_error") is not None`, AND it only mints an SDPO row when **both** a hint is produced **and** the recovery turn has content (L308) — so empty-recovery sites are skipped.
181
+
182
+ **Extending the detector for the new behavior/error classes (additions, not replacements).** Keep `error_kind` as the routing key the generator already receives, and broaden the classifier so the new templates (§3) and the judge router (§4) get the right `behavior_dim`:
183
+
184
+ | Signal in trace | Detected via | New `error_kind` → behavior_dim |
185
+ |---|---|---|
186
+ | Failed tool status / `is_error: true` | structural flag (existing) | `tool_error`/`tool_not_found`/… → `tool_use` |
187
+ | Exception traceback in tool output | regex `Traceback (most recent call last)` / `Error:` | `runtime_error` → `tool_use` |
188
+ | Malformed args / JSON | parse failure of the tool-call args | `malformed_args`/`json_decode` → `tool_use` |
189
+ | Test runner non-zero exit / assertion | regex `FAILED|AssertionError|[0-9]+ failed` in output | `failing_test` → `tool_use` (verifiable) |
190
+ | Linter/formatter non-zero exit | regex `{ruff|eslint|flake8|black}.*(error|would reformat)`; nonzero exit code | `lint_style` → `style` |
191
+ | Redundant/no-op action | heuristic: action equals a prior action's signature; or no state delta | `wasteful_action` → `effort` |
192
+ | Over-long / off-task assistant message | **LLM-judge flag only** (no structural signal) | `verbose_communication` → `communication` |
193
+
194
+ Implementation alignment rule: **add these as new `(kind, regex)` rows to `_ERROR_KIND_PATTERNS` in `trace_examples.py`** (same ordered-precedence mechanism already there — note its comment that `command_not_found` must precede `file_not_found`), so detection stays in **one** place and the generator stays a pure `error_kind → hint` function. Style/comms/effort sites that have **no structural signature** are surfaced only by the judge and should be gated (sampled, e.g. 10–20% of clean turns) to bound cost.
195
+
196
+ ---
197
+
198
+ ## 6. Recommended layered design + `HintGenerator` Protocol
199
+
200
+ ### 6.1 Protocol
201
+
202
+ A clean, typed Protocol that subsumes the current `dispatch` and the existing `CollatorConfig.hint_generator: Callable[[str, dict], str | None]` hook. The collator calls `generator(error_kind, error_meta)`; we wrap the Protocol with a tiny adapter so **no collator change is required**.
203
+
204
+ ```python
205
+ # composer_replication/hints/protocol.py
206
+ from __future__ import annotations
207
+ from typing import Protocol, TypedDict, runtime_checkable
208
+
209
+
210
+ class ErrorContext(TypedDict, total=False):
211
+ """Everything a hint source might need. Superset of the current HintContext."""
212
+ error_kind: str # routing key from trace_examples classifier
213
+ behavior_dim: str # "tool_use" | "style" | "communication" | "effort"
214
+ error_message: str # raw env/tool error text (enables class (b))
215
+ available_tools: list[str]
216
+ tool_name: str
217
+ tool_schema: dict
218
+ state_excerpt: str # last-k turns, for the judge (class (c)/(d))
219
+ erroring_action: str # the failed assistant turn
220
+ sibling_rollouts: list[dict] # GRPO group; passing ones enable class (f)
221
+ repeat_count: int # for repeated_failure
222
+
223
+
224
+ @runtime_checkable
225
+ class HintGenerator(Protocol):
226
+ """A hint source. Returns the hint text, or None to decline this site."""
227
+ def generate(self, ctx: ErrorContext) -> str | None: ...
228
+ ```
229
+
230
+ ### 6.2 Layered composite (template-first → judge → sibling-bootstrap)
231
+
232
+ ```python
233
+ # composer_replication/hints/layered.py
234
+ from dataclasses import dataclass, field
235
+ from .protocol import HintGenerator, ErrorContext
236
+ from . import templates # wraps the existing HINT_TEMPLATES registry
237
+ from . import judge # LLM-judge generator (class (c))
238
+ from . import sibling # SDPO successful-sibling bootstrap (class (f))
239
+
240
+
241
+ @dataclass
242
+ class LayeredHintGenerator:
243
+ """Try each source in order; first non-None wins. A wrong hint is
244
+ bounded-bad (teacher is stop-grad), so cheap layers go first and we
245
+ only escalate to paid/learned layers on a miss."""
246
+ layers: list[HintGenerator] = field(default_factory=list)
247
+
248
+ def generate(self, ctx: ErrorContext) -> str | None:
249
+ for layer in self.layers:
250
+ hint = layer.generate(ctx)
251
+ if hint: # non-empty, non-None
252
+ return hint
253
+ return None # collator then skips this SDPO site
254
+
255
+ # Adapter for the existing CollatorConfig.hint_generator signature.
256
+ def as_collator_hook(self):
257
+ def hook(error_kind: str, error_meta: dict) -> str | None:
258
+ ctx: ErrorContext = {"error_kind": error_kind, **(error_meta or {})}
259
+ return self.generate(ctx)
260
+ return hook
261
+
262
+
263
+ def default_layered(*, judge_client=None, enable_judge=True) -> LayeredHintGenerator:
264
+ layers: list[HintGenerator] = [
265
+ templates.TemplateHintGenerator(), # (a) free, deterministic
266
+ templates.RawErrorHintGenerator(), # (b) raw env error as hint
267
+ ]
268
+ if enable_judge and judge_client is not None:
269
+ layers.append(judge.JudgeHintGenerator(judge_client)) # (c)
270
+ layers.append(sibling.SiblingBootstrapGenerator()) # (f) backstop
271
+ return LayeredHintGenerator(layers=layers)
272
+ ```
273
+
274
+ Wiring (unchanged collator contract):
275
+
276
+ ```python
277
+ gen = default_layered(judge_client=my_small_model_client)
278
+ config = CollatorConfig(hint_generator=gen.as_collator_hook(), enable_sdpo=True)
279
+ collator = ComposerDataCollator(tokenizer=tok, config=config)
280
+ ```
281
+
282
+ ### 6.3 The three new layers (sketches)
283
+
284
+ ```python
285
+ # (a)+(b) templates.py — reuse the EXISTING registry verbatim
286
+ from composer_replication.hint_generator import dispatch # current module
287
+
288
+ class TemplateHintGenerator:
289
+ def generate(self, ctx):
290
+ return dispatch(ctx.get("error_kind", ""), ctx) # None on unknown kind
291
+
292
+ class RawErrorHintGenerator:
293
+ """Class (b): SDPO 'environment output' feedback — splice the raw error."""
294
+ def generate(self, ctx):
295
+ msg = (ctx.get("error_message") or "").strip()
296
+ if not msg:
297
+ return None
298
+ return f"Reminder: the previous action returned this error:\n{msg[:200]}\nFix the cause and retry."
299
+ ```
300
+
301
+ ```python
302
+ # (c) judge.py — class (c), prompt from §4.1
303
+ class JudgeHintGenerator:
304
+ def __init__(self, client, cache=None):
305
+ self.client, self.cache = client, (cache if cache is not None else {})
306
+ def generate(self, ctx):
307
+ key = hash((ctx.get("erroring_action"), ctx.get("error_message"),
308
+ ctx.get("behavior_dim")))
309
+ if key in self.cache:
310
+ return self.cache[key]
311
+ hint = self.client.complete(_build_judge_prompt(ctx)) # §4.1 template
312
+ hint = None if hint.strip() == "NO_HINT" else hint.strip()
313
+ self.cache[key] = hint
314
+ return hint
315
+ ```
316
+
317
+ ```python
318
+ # (f) sibling.py — SDPO 'successful rollouts as implicit feedback'
319
+ class SiblingBootstrapGenerator:
320
+ """Class (f): when nothing else fired but a sibling rollout in the same
321
+ GRPO group PASSED, condition the teacher on that success."""
322
+ def generate(self, ctx):
323
+ sibs = ctx.get("sibling_rollouts") or []
324
+ winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
325
+ if not winners:
326
+ return None
327
+ best = max(winners, key=lambda s: s["reward"])
328
+ snippet = (best.get("solution_excerpt") or "")[:200]
329
+ return ("Reminder: a working approach for this task looks like:\n"
330
+ f"{snippet}\nAdapt this to the current step.")
331
+ ```
332
+
333
+ > **Note on class (d):** same-model introspection (OPSD `--reason_first`) is the *training model* critiquing its own turn — best implemented inside the trainer (where the model is loaded) rather than the collator, since it needs a model forward pass. Add it as a fourth layer once the trainer exposes a `self_critique(ctx) -> str` callable; the Protocol already supports it. For v0.1, the judge (c) is the simpler stand-in for the same role.
334
+
335
+ ### 6.4 Why this order (decision summary)
336
+
337
+ 1. **Templates + raw-error (a/b)** are free, deterministic, and cover the **tool-use** class — the bulk of structural error sites. They reproduce Cursor's one published example (`"Reminder: Available tools are…"`) exactly.
338
+ 2. **Judge (c)** is the only layer that *manufactures* a corrective for **style / communication / effort**, the behavior classes the mapping doc flags as the real test of the recipe (`COMPOSER_RECIPE_MAPPING.md` §"point 2"). Gated + cached → ~$0.0005/site, paid only on template miss.
339
+ 3. **Sibling-bootstrap (f)** is the SDPO-native fallback when there's no template, no judge (or judge declined), but the rollout group contains a winner — *free* privileged info from the model's own success. This is the lever `09-composer-blog-delta-2026.md` §3 action item 3 told us to record.
340
+ 4. **Learned generator (e)** drops in as a new layer in v0.2 (`COMPOSER_RECIPE_MAPPING.md` table row (d): "+ learned hint generator") without touching the Protocol or the collator.
341
+
342
+ ---
343
+
344
+ ## 7. Implementation handles (v0.1)
345
+
346
+ Concrete, ordered work items. Everything below preserves the existing `CollatorConfig.hint_generator: Callable[[str, dict], str | None]` contract — **no collator surgery**.
347
+
348
+ 1. **Keep `hint_generator.py` as layer (a).** It already implements 5 templates with the right `dispatch(error_kind, ctx) -> str | None` signature. Add the four new templates from §3 (`failing_test`, `lint_style`, `wasteful_action`, `verbose_communication`) via `register(...)`. **Actual strings shipped in §3** — copy verbatim.
349
+ 2. **Add the new error_kind regexes to `_ERROR_KIND_PATTERNS` in `trace_examples.py`** (§5 table). Single source of detection truth; preserve the ordered-precedence comment pattern (`command_not_found` before `file_not_found`). Route each `error_kind → behavior_dim` so the judge gets correct routing.
350
+ 3. **Build `composer_replication/hints/`** with `protocol.py`, `layered.py`, `templates.py`, `judge.py`, `sibling.py` (§6 sketches). `templates.py` *imports the existing `dispatch`* — do not reimplement.
351
+ 4. **Wire via the adapter:** `CollatorConfig(hint_generator=default_layered(...).as_collator_hook())`. The `claude_states_to_trace_examples` adapter already populates `error_meta` (`source_content_excerpt[:200]`); extend it to also stash `error_message` (for class (b)) and, when available from the GRPO loop, `sibling_rollouts` (for class (f)).
352
+ 5. **Borrow OPSD stabilizers for the loss side:** when distilling style/comms hints, apply **per-token JSD clipping** (OPSD `--jsd_token_clip`, default `0.05`) so "stylistic tokens" don't dominate — the README states this is exactly why it exists. Pair with the collator's existing `sdpo_loss_mask` (post-hint tokens only).
353
+ 6. **Gate the judge:** fire (c) only on (i) template miss, or (ii) a sampled fraction (~10–20%) of clean turns flagged for style/comms/effort, with hint caching keyed on `(erroring_action, error_message, behavior_dim)`. Bounds cost at ~$0.001–0.002/trace.
354
+ 7. **Eval the generator independently of training** (matches `COMPOSER_RECIPE_MAPPING.md` concern that "SDPO with hardcoded templates is the easy case"): measure (a) % of error sites that get a non-None hint per layer, (b) teacher-vs-student KL *increase* at hinted turns (a good hint should *raise* divergence — it's shifting probability toward the fix, per the blog's "lowering wrong-tool, raising valid-replacement"), and (c) for style/comms, a held-out judge agreeing the hint is corrective. A hint that doesn't move the teacher distribution is a no-op and should be pruned.
355
+
356
+ ---
357
+
358
+ ## 8. Citations
359
+
360
+ - **SDPO** — Hübotter, Lübeck, Behric, Baumann, Bagatella, Marta, Hakimi, Shenfeld, Kleine Buening, Guestrin, Krause (ETH Zürich), *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802) (v1 28 Jan 2026, v2 16 Feb 2026), CC-BY-4.0. Abstract + method/ablation HTML ([html v2](https://arxiv.org/html/2601.20802v2)). The three-feedback-type ablation (sample solution / environment output / student's original attempt) and the "successful rollouts as implicit feedback" claim are the load-bearing sources for §1.2 and taxonomy classes (b), (d), (f).
361
+ - **OPSD** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models*, [arXiv:2601.18734v3](https://arxiv.org/abs/2601.18734), code [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) (paper CC-BY-4.0; verify code license on repo). Privileged-info teacher (`y⋆` = ground-truth/reference CoT), Eq. 8 loss, stop-grad teacher, and the `--reason_first` (introspection) + `--jsd_token_clip` (stylistic-token stabilizer) flags are the sources for §1.1 and the OPSD handles in §7.
362
+ - **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026): the "Reminder: Available tools are…" example, "hint changes the teacher probabilities / update student weights for that turn only," and the three behavior targets (tool use, coding style, model communication). Via `docs/COMPOSER_RECIPE_MAPPING.md` §1 and `research/09-composer-blog-delta-2026.md` §2.
363
+ - **Composer 2 technical report** — [arXiv:2603.24477](https://arxiv.org/abs/2603.24477) / [Composer2.pdf](https://cursor.com/resources/Composer2.pdf) (Rush et al.): flagged in the delta doc as the most likely place to resolve hint-generation directly; **still unread** — a dedicated extraction is the recommended follow-up if this design needs validation against Cursor's actual mechanism.
364
+ - **In-repo (audited):** `composer_replication/hint_generator.py` (current layer (a)), `composer_replication/trainer/data_collator.py` (`CollatorConfig.hint_generator` hook, `_build_hint_injected_trace`, hint-AND-recovery gate at L308), `composer_replication/ingestion/trace_examples.py` (structural error detection, `_ERROR_KIND_PATTERNS`, `default_classify_error`).
365
+
366
+ > **Residual gap:** Cursor still never states which hint source they use; this design *brackets* their unknown choice with the OPSD (privileged-answer) and SDPO (environment-feedback + sibling-bootstrap) endpoints and makes all of them composable. The one unread artifact that could collapse the bracket is Composer2.pdf (arXiv:2603.24477).
research/08-sdpo-grpo-integration.md ADDED
@@ -0,0 +1,499 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SDPO ⊕ Dr. GRPO: wiring the on-policy KL-at-error-turns channel into a live RL loop
2
+
3
+ > **Design date:** 2026-05-28.
4
+ > **Scope:** A concrete, implementable design for adding the SDPO auxiliary
5
+ > loss channel (on-policy KL at error turns, teacher = same weights conditioned
6
+ > on a hint) as a **second loss head** on a live **Dr. GRPO** update step. Targets
7
+ > the two integration substrates already in this repo: the **PRIME-RL parity
8
+ > recipe** (`recipes/prime_rl/composer_loss.py`) and the **TRL `GRPOTrainer`
9
+ > subclass** (`trainer/composer_trainer.py`). Recommends the TRL subclass as the
10
+ > host and gives a ~70-LoC `ComposerGRPOTrainer` sketch.
11
+ > **Method:** Lead with local-file analysis of `loss.py`, `composer_loss.py`,
12
+ > `composer_trainer.py`, `data_collator.py`, plus `research/07` (HintGenerator)
13
+ > and `research/10` (the Dr. GRPO target). One bounded TRL API lookup
14
+ > (`mcp_exa_get_code_context_exa` on `huggingface/trl@main`) to confirm the
15
+ > `GRPOTrainer` loss-override surface; the DeepWiki follow-up timed out, so the
16
+ > version-robust guard in §4 documents both the `_compute_loss(self, model,
17
+ > inputs)` internal hook (what this repo already overrides) and the public
18
+ > `compute_loss(self, model, inputs, return_outputs=False,
19
+ > num_items_in_batch=None)` HF-Trainer wrapper.
20
+
21
+ ---
22
+
23
+ ## TL;DR
24
+
25
+ SDPO is **not** the GRPO-KL-to-reference term and must not be folded into it. It
26
+ is a **separate distillation head**: a generalized-JSD between the student's
27
+ on-policy logits and the **same model's** logits when its context has a hint
28
+ spliced in at the error turn, masked to the post-hint recovery tokens. The
29
+ integration is therefore "compute the Dr. GRPO loss as usual, then **add
30
+ `beta_sdpo · JSD_error_turns`** before `.backward()`."
31
+
32
+ - **Host = the TRL `GRPOTrainer` subclass.** It already exists
33
+ (`ComposerReplicationTrainer`), already overrides the loss with exactly this
34
+ `grpo + alpha*sdpo + beta*replay` shape, and — decisively — it has **full
35
+ logits** in `_compute_loss`. The PRIME-RL recipe **cannot** host SDPO today:
36
+ its `LossInputs` exposes per-token **log-probs only, not full vocabulary
37
+ logits**, and `composer_loss.py` correctly raises `NotImplementedError` when
38
+ `alpha_sdpo>0`. SDPO needs the full distribution; PRIME-RL is blocked until
39
+ upstream exposes logits.
40
+ - **Attach point:** inside the Dr. GRPO update step, after the policy-gradient +
41
+ k1-KL loss is computed on the minibatch, run one **student forward (grad)** +
42
+ one **teacher forward (`no_grad`, hint-spliced context)**, take
43
+ `generalized_jsd_loss` masked to `sdpo_loss_mask`, scale by `beta_sdpo`, and
44
+ add. Single-epoch Dr. GRPO makes this clean: the teacher forward happens on
45
+ **the same minibatch being updated**, so the KL is genuinely on-policy.
46
+ - **Dr. GRPO specifics are preserved untouched:** SDPO touches neither the
47
+ advantage estimator (no std-norm, no length-standardization) nor the GRPO
48
+ **k1** (`−log r`) KL-to-ref. It is purely additive.
49
+ - **CPU-testable:** a 1–2 rollout Dr. GRPO step on Qwen2.5-0.5B with the SDPO
50
+ channel on, mirroring the existing `examples/sdpo_real_trace_train_smoke`.
51
+
52
+ ---
53
+
54
+ ## 1. The two in-repo substrates, and why TRL is the host
55
+
56
+ ### 1.1 Substrate A — TRL `GRPOTrainer` subclass (`trainer/composer_trainer.py`)
57
+
58
+ Already in the repo and already the right shape. `ComposerReplicationTrainer`
59
+ subclasses `trl.GRPOTrainer` and overrides:
60
+
61
+ ```python
62
+ def _compute_loss(self, model, inputs) -> torch.Tensor:
63
+ grpo_loss = super()._compute_loss(model, inputs) # channel 1
64
+ sdpo_kl = self._compute_sdpo_loss(model, inputs) # channel 2
65
+ replay_dpo = self._compute_trace_replay_loss(model, inputs)
66
+ return grpo_loss + self.alpha_sdpo*sdpo_kl + self.beta_replay*replay_dpo
67
+ ```
68
+
69
+ `_compute_sdpo_loss` (lines 133–178) already does the **student forward (grad) +
70
+ teacher forward (`no_grad`) over `ctx_teacher_input_ids`**, the
71
+ `student_logits.shape == teacher_logits.shape` gate, and
72
+ `generalized_jsd_loss(..., labels=inputs["sdpo_loss_mask"], beta, temperature,
73
+ token_clip, reduction="batchmean")`. This is the SDPO channel, intact. **It has
74
+ full logits** — the prerequisite PRIME-RL lacks.
75
+
76
+ **Decisive property:** TRL hands the subclass `model` and `inputs` and lets it
77
+ return any scalar; full `.logits` are available for both the student and the
78
+ hint-conditioned teacher forward. SDPO is a drop-in.
79
+
80
+ ### 1.2 Substrate B — PRIME-RL parity recipe (`recipes/prime_rl/composer_loss.py`)
81
+
82
+ PRIME-RL's `CustomLossConfig` takes an importable `loss_fn(inputs:
83
+ LossInputs)` called **once per sample** on **1-D `(seq,)` tensors**. Channel 1
84
+ (DPPO + k1-style KL on the importance ratio) is **byte-for-byte parity-verified**
85
+ against upstream `default_loss_fn` and is an excellent Dr.-GRPO-adjacent PG loss.
86
+
87
+ But SDPO is **deferred by construction**:
88
+
89
+ ```python
90
+ # composer_loss.py, lines 257-268
91
+ teacher_lp = getattr(inputs, "teacher_logprobs", None)
92
+ if alpha_sdpo > 0:
93
+ raise NotImplementedError(
94
+ "SDPO channel in the PRIME-RL recipe is deferred. PRIME-RL v0.5 "
95
+ "exposes (seq,) log-probs through LossInputs but not full vocabulary "
96
+ "logits, and SDPO/OPSD requires the full distribution. ...")
97
+ ```
98
+
99
+ `generalized_jsd_loss` calls `log_softmax(dim=-1)` over the vocab axis. With
100
+ only a `(seq,)` log-prob vector there is **no vocab axis** — softmax of a
101
+ 1-element slice is identically 1.0 and `log` is 0, i.e. a mathematically
102
+ degenerate, silently-zero channel (the Wave-13 finding the docstring cites). So
103
+ SDPO in PRIME-RL is blocked **until upstream exposes per-token full logits**, not
104
+ a thing we can paper over.
105
+
106
+ ### 1.3 Recommendation
107
+
108
+ **Host the SDPO aux channel in the TRL `GRPOTrainer` subclass.** Rationale:
109
+
110
+ 1. **Logits available** — the one hard requirement SDPO has and PRIME-RL lacks.
111
+ 2. **The override already exists** with the exact additive shape; we
112
+ re-point channel 1 at Dr. GRPO and tighten the teacher forward (§4).
113
+ 3. **Single-process, CPU-runnable** — matches the existing smoke harness, so the
114
+ SDPO-on Dr.-GRPO step is testable today (§6) without PRIME-RL's 3-actor mesh.
115
+ 4. PRIME-RL stays the **scale/parity** path for channel-1-only runs; SDPO lands
116
+ there for free the moment `LossInputs.teacher_logits` (full distribution)
117
+ exists upstream — the adapter is otherwise ready.
118
+
119
+ > One caveat to fix while we're here: the current `ComposerReplicationTrainer`
120
+ > channel 1 is *vanilla* GRPO (`super()._compute_loss`). The Composer target is
121
+ > **Dr. GRPO** (`research/10`): length-standardization removed, **no std-dev
122
+ > advantage normalization**, **k1** (`−log r`) KL, Adam, single-epoch. §3 + §4
123
+ > pin those into the subclass; SDPO rides on top unchanged.
124
+
125
+ ---
126
+
127
+ ## 2. The exact attach point + data flow
128
+
129
+ SDPO attaches **inside one Dr. GRPO update step, after the PG+KL loss is formed,
130
+ before backward**. It is one extra additive scalar. Concretely, per minibatch:
131
+
132
+ ```
133
+ ┌─────────────────────── one Dr. GRPO update step (single-epoch) ──────────────────────┐
134
+ rollout ──▶ │ Channel 1 (Dr. GRPO): │
135
+ trajectory │ advantages = (R - group_mean) # NO /std, NO length-standardization │
136
+ (group of K)│ logπ_new = model(input_ids).logprobs # the on-policy student forward (grad) │
137
+ │ log_r = logπ_new - logπ_old # log importance ratio (old = rollout-time) │
138
+ │ pg = -(advantages * exp(log_r))[resp_mask] │
139
+ │ kl = (-log_r)[resp_mask] # k1 estimator, NOT k3 │
140
+ │ L_drgrpo = (pg + beta_kl * kl).sum() │
141
+ │ │
142
+ │ Channel 2 (SDPO) — SAME minibatch, reuses the student forward where possible: │
143
+ │ error sites ◀── reuse ingestion structural `tool_error` (research/07 §5) │
144
+ │ │ (turn.get("tool_error") is not None; single source of truth) │
145
+ │ ▼ │
146
+ │ HintGenerator.generate(ErrorContext) ──▶ hint text (research/07 §6, layered) │
147
+ │ │ │
148
+ │ ▼ data_collator splices hint at the error turn: │
149
+ │ ctx_teacher_input_ids (hint system-msg + recovery turn, chat-template aligned) │
150
+ │ input_ids (placeholder-of-equal-token-length so shapes match) │
151
+ │ sdpo_loss_mask (1 on post-hint recovery tokens only) │
152
+ │ │ │
153
+ │ ▼ │
154
+ │ student_logits = model(input_ids).logits # grad │
155
+ │ with no_grad: teacher_logits = model(ctx_teacher_input_ids).logits # stop-grad │
156
+ │ L_sdpo = generalized_jsd_loss(student, teacher, │
157
+ │ labels=sdpo_loss_mask, beta=jsd_beta, │
158
+ │ temperature=1.0, token_clip=0.05) # masked to error turn │
159
+ │ │
160
+ │ total = L_drgrpo + beta_sdpo * L_sdpo ──▶ .backward() ──▶ Adam.step() │
161
+ └────────────────────────────────────────────────────────────────────────────────────────┘
162
+ ```
163
+
164
+ Key flow facts:
165
+
166
+ - **Error-site detection is not re-invented.** The ingestion layer already sets
167
+ `turn["tool_error"] = <error_kind>` (structural `is_error:true` flag first,
168
+ string-tag fallback), and the collator's `_is_error_turn` keys on exactly that
169
+ (`research/07` §5). The trainer **consumes** the collator's
170
+ `ctx_teacher_input_ids` / `sdpo_loss_mask`; it does not detect errors itself.
171
+ - **HintGenerator is called at collation time**, not in the loss. Per
172
+ `research/07` §6.1, the generator's only job is to produce the text spliced
173
+ into the teacher context; the collator's `_build_hint_injected_trace` does the
174
+ splice and the equal-length student alignment
175
+ (`_build_aligned_student_for_sdpo`). The trainer sees finished tensors.
176
+ - **The teacher forward is on the live weights**, hint-conditioned, `no_grad`.
177
+ It is *not* a separate model and *not* a re-rollout (`research/07` §1.3). One
178
+ extra forward per SDPO minibatch.
179
+ - **The JSD is masked to the error turn** via `sdpo_loss_mask` (post-hint
180
+ recovery tokens only), so SDPO supervises *exactly* the turn the hint targets,
181
+ leaving the rest of the trajectory to channel 1.
182
+
183
+ ---
184
+
185
+ ## 3. Reconciling with Dr. GRPO specifics
186
+
187
+ `research/10` pins the algorithm. SDPO must coexist without perturbing any of it:
188
+
189
+ | Dr. GRPO property (`research/10` §2) | Where it lives | SDPO interaction |
190
+ |---|---|---|
191
+ | **No std-dev advantage normalization** | advantage estimator | **None.** SDPO never touches advantages. Keep `A = R - group_mean` (no `/std`). |
192
+ | **Length-standardization term removed** | PG reduction | **None.** SDPO is a separate head; do not re-introduce a `1/|y|` factor via SDPO's reduction either (use `batchmean` over masked error-turn tokens, which is SDPO's own normalization, independent of trajectory length). |
193
+ | **k1 KL = `−log r`** (NOT k3) | GRPO KL-to-ref term | **Distinct from SDPO.** The GRPO k1 KL regularizes the policy toward the *reference/old* policy on all response tokens. SDPO's JSD pulls the policy toward the *hint-conditioned self-teacher* on error-turn tokens. Two different targets, two different token sets, two different weights (`beta_kl` vs `beta_sdpo`). Never merge them. |
194
+ | **Single-epoch (a prompt is never trained twice)** | outer loop | **This is what makes SDPO clean.** The teacher forward happens on the *same minibatch being updated this step* — the student logits and the hint-conditioned teacher logits are both from the current weights on the current rollout, so the distilled KL is genuinely **on-policy** (SDPO's defining property). No stale-teacher / replay-buffer drift to reconcile. |
195
+ | **Adam, full-parameter, async rollouts** | optimizer / infra | **None.** SDPO adds gradient only through the student forward; Adam consumes the summed gradient transparently. Async/off-policy weight sync (PipelineRL-style) affects channel 1's `logπ_old`; SDPO's teacher is the *current* weights so it is unaffected. |
196
+
197
+ **The one thing to get right:** SDPO's JSD is **SEPARATE** from the GRPO
198
+ KL-to-ref. In the loss expression `total = L_drgrpo + beta_sdpo*L_sdpo`, the
199
+ `L_drgrpo` already *contains* its own `beta_kl * k1_kl`. Do not let `beta_sdpo`
200
+ masquerade as a KL coefficient or vice-versa; they are logged separately
201
+ (`loss/grpo_kl` vs `loss/sdpo_jsd`).
202
+
203
+ ---
204
+
205
+ ## 4. Implementation handles — `ComposerGRPOTrainer(GRPOTrainer)`
206
+
207
+ A focused subclass that (a) forces channel 1 into the Dr. GRPO regime and (b)
208
+ adds the SDPO head. This refines the existing `ComposerReplicationTrainer`; the
209
+ SDPO method is lifted almost verbatim from `composer_trainer.py:_compute_sdpo_loss`
210
+ (it is already correct), and the Dr. GRPO config is pinned via `GRPOConfig`.
211
+
212
+ ### 4.1 The loss-override surface (version-robust)
213
+
214
+ The repo already overrides `_compute_loss(self, model, inputs)` — the internal
215
+ per-step loss hook TRL's `GRPOTrainer` exposes, and what this subclass keeps
216
+ using. Recent TRL wraps that in the HF `Trainer.compute_loss(self, model,
217
+ inputs, return_outputs=False, num_items_in_batch=None)`. To be robust to either
218
+ surface, override **`_compute_loss`** (present across the versions this repo
219
+ targets) and additionally provide a thin `compute_loss` shim that delegates, so
220
+ the subclass works whether TRL calls the internal or the public method:
221
+
222
+ ```python
223
+ def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
224
+ loss = self._compute_loss(model, inputs) # our composed loss
225
+ return (loss, None) if return_outputs else loss
226
+ ```
227
+
228
+ If a future TRL drops `_compute_loss`, move the channel-1 call to
229
+ `super().compute_loss(model, inputs, return_outputs=True,
230
+ num_items_in_batch=num_items_in_batch)[0]` inside `_compute_loss` — the SDPO
231
+ add-on is unaffected.
232
+
233
+ ### 4.2 The sketch (~70 LoC)
234
+
235
+ ```python
236
+ # composer_replication/trainer/composer_grpo_trainer.py
237
+ from __future__ import annotations
238
+ from typing import Any
239
+ import logging, torch
240
+
241
+ try:
242
+ from trl import GRPOTrainer, GRPOConfig # noqa: F401
243
+ _TRL = True
244
+ except ImportError: # doc/test import without TRL
245
+ GRPOTrainer = object; _TRL = False
246
+
247
+ from composer_replication.opsd import generalized_jsd_loss
248
+
249
+ logger = logging.getLogger(__name__)
250
+
251
+
252
+ def make_dr_grpo_config(**overrides: Any) -> "GRPOConfig":
253
+ """Dr. GRPO regime (research/10 §2): no std-norm, no length-standardization,
254
+ k1 KL, single-epoch, Adam. We pin what GRPOConfig exposes and assert the
255
+ rest. TRL flag names drift across versions, so set defensively + log."""
256
+ cfg_kwargs = dict(
257
+ num_iterations=1, # single-epoch: a prompt is never re-trained
258
+ scale_rewards=False, # << NO std-dev advantage normalization (Dr. GRPO)
259
+ loss_type="dr_grpo", # TRL's Dr. GRPO loss_type: drops length-standardization;
260
+ # if absent in your TRL, fall back to "grpo" and
261
+ # override the reduction (see assert below).
262
+ optim="adamw_torch", # Adam(W); Composer 2 uses Adam for RL
263
+ beta=0.0, # GRPO KL-to-ref coeff; set >0 to enable the k1 term
264
+ )
265
+ cfg_kwargs.update(overrides)
266
+ return GRPOConfig(**cfg_kwargs)
267
+
268
+
269
+ class ComposerGRPOTrainer(GRPOTrainer): # type: ignore[misc,valid-type]
270
+ """Dr. GRPO + SDPO (on-policy KL at error turns). SDPO is an ADDITIVE head;
271
+ it never touches advantages or the GRPO-KL-to-ref term."""
272
+
273
+ def __init__(self, *args: Any, beta_sdpo: float = 0.0, sdpo_jsd_beta: float = 0.5,
274
+ sdpo_temperature: float = 1.0, sdpo_token_clip: float | None = 0.05,
275
+ sdpo_warmup_steps: int = 0, beta_sdpo_max: float | None = None,
276
+ **kwargs: Any):
277
+ if not _TRL:
278
+ raise ImportError("ComposerGRPOTrainer requires TRL: pip install -e .[train]")
279
+ super().__init__(*args, **kwargs)
280
+ self.beta_sdpo = beta_sdpo
281
+ self.beta_sdpo_max = beta_sdpo_max if beta_sdpo_max is not None else beta_sdpo
282
+ self.sdpo_warmup_steps = sdpo_warmup_steps
283
+ self.sdpo_jsd_beta = sdpo_jsd_beta
284
+ self.sdpo_temperature = sdpo_temperature
285
+ self.sdpo_token_clip = sdpo_token_clip
286
+ # Dr. GRPO sanity pins (loud, not silent): if the TRL version ignored a
287
+ # flag, surface it rather than train vanilla GRPO by accident.
288
+ if getattr(self.args, "scale_rewards", True):
289
+ logger.warning("Dr. GRPO requires scale_rewards=False (no std-norm); "
290
+ "GRPOConfig.scale_rewards=%s — advantages may be std-normalized.",
291
+ getattr(self.args, "scale_rewards", None))
292
+
293
+ def _beta_sdpo_now(self) -> float:
294
+ """Linear warmup so SDPO doesn't swamp the early policy gradient (§5)."""
295
+ step = getattr(getattr(self, "state", None), "global_step", 0) or 0
296
+ if self.sdpo_warmup_steps <= 0:
297
+ return self.beta_sdpo
298
+ frac = min(1.0, step / float(self.sdpo_warmup_steps))
299
+ return self.beta_sdpo + frac * (self.beta_sdpo_max - self.beta_sdpo)
300
+
301
+ def _compute_loss(self, model, inputs):
302
+ drgrpo = super()._compute_loss(model, inputs) # channel 1 (Dr. GRPO, k1 KL)
303
+ sdpo = self._compute_sdpo_loss(model, inputs) # channel 2 (additive)
304
+ beta = self._beta_sdpo_now()
305
+ total = drgrpo + beta * sdpo
306
+ if self.state.global_step % getattr(self.args, "logging_steps", 50) == 0:
307
+ self.log({"loss/grpo": float(drgrpo.detach()),
308
+ "loss/sdpo_jsd": float(sdpo.detach()),
309
+ "loss/beta_sdpo": beta, "loss/total": float(total.detach())})
310
+ return total
311
+
312
+ def _compute_sdpo_loss(self, model, inputs):
313
+ if (self._beta_sdpo_now() == 0.0
314
+ or "ctx_teacher_input_ids" not in inputs
315
+ or inputs["ctx_teacher_input_ids"].numel() == 0):
316
+ return torch.zeros((), device=next(model.parameters()).device, requires_grad=True)
317
+ student = model(input_ids=inputs["input_ids"]).logits # grad
318
+ with torch.no_grad():
319
+ teacher = model(input_ids=inputs["ctx_teacher_input_ids"]).logits # stop-grad
320
+ if student.shape != teacher.shape: # collator alignment guard
321
+ logger.warning("SDPO shape mismatch student=%s teacher=%s; skipping step.",
322
+ student.shape, teacher.shape)
323
+ return torch.zeros((), device=student.device, requires_grad=True)
324
+ return generalized_jsd_loss(student_logits=student, teacher_logits=teacher,
325
+ labels=inputs.get("sdpo_loss_mask"),
326
+ beta=self.sdpo_jsd_beta, temperature=self.sdpo_temperature,
327
+ token_clip=self.sdpo_token_clip, reduction="batchmean")
328
+
329
+ def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
330
+ loss = self._compute_loss(model, inputs)
331
+ return (loss, None) if return_outputs else loss
332
+ ```
333
+
334
+ ### 4.3 How error-turn batches reach the trainer
335
+
336
+ **Reuse `ComposerDataCollator` verbatim** — it already emits
337
+ `ctx_teacher_input_ids` + `sdpo_loss_mask` and (critically) the
338
+ **equal-length student** via `_build_aligned_student_for_sdpo` (the placeholder
339
+ trick that keeps `student_logits.shape == teacher_logits.shape` so the JSD gate
340
+ passes; the Gemini-W19 alias bug is already handled there). Wiring:
341
+
342
+ ```python
343
+ gen = default_layered(judge_client=small_model).as_collator_hook() # research/07 §6
344
+ collator = ComposerDataCollator(tokenizer=tok,
345
+ config=CollatorConfig(hint_generator=gen, enable_sdpo=True,
346
+ enable_replay_dpo=False))
347
+ trainer = ComposerGRPOTrainer(model=model, args=make_dr_grpo_config(...),
348
+ train_dataset=ds, data_collator=collator,
349
+ beta_sdpo=0.1, sdpo_warmup_steps=50, sdpo_token_clip=0.05,
350
+ reward_funcs=[my_rlvr_reward])
351
+ ```
352
+
353
+ > **GRPO-rollout vs collator note.** TRL's `GRPOTrainer` generates rollouts
354
+ > internally and forms its own `inputs` (prompt + completions + advantages). For
355
+ > SDPO the error sites come from the *rollout trajectory itself* (tool errors in
356
+ > the just-generated completions), so the SDPO tensors must be built **from the
357
+ > live rollout**, not from a static dataset. Two equivalent integration modes:
358
+ > (1) **post-rollout hook** — override `_generate_and_score_completions` (or the
359
+ > rollout collation step) to run the structural `tool_error` detector +
360
+ > `HintGenerator` + `ComposerDataCollator._build_sdpo_fields` on the generated
361
+ > completions and stash `ctx_teacher_input_ids`/`sdpo_loss_mask` into `inputs`;
362
+ > (2) **offline-trace mode** (what the smoke uses) — feed pre-ingested
363
+ > error-bearing traces through the collator as the dataset, exercising the exact
364
+ > loss path on CPU. Mode (2) is the test; mode (1) is production. The
365
+ > `_compute_sdpo_loss` body is identical for both — it only reads the two SDPO
366
+ > keys.
367
+
368
+ ---
369
+
370
+ ## 5. Weighting, scheduling, and guardrails
371
+
372
+ So SDPO informs without swamping the policy gradient:
373
+
374
+ 1. **Scale.** Start `beta_sdpo = 0.1` (the library default `alpha_sdpo`), not the
375
+ `1.0` the smoke uses (the smoke over-weights deliberately to *prove the path
376
+ fires*). The Dr. GRPO PG loss is a `sum()` over response tokens; SDPO is a
377
+ `batchmean` JSD over error-turn tokens — different magnitudes. **Normalize
378
+ first:** log `loss/grpo` and `loss/sdpo_jsd` separately for the first ~50
379
+ steps and pick `beta_sdpo` so `beta_sdpo·sdpo_jsd ≈ 0.1–0.3 × |grpo|` at
380
+ steady state. Do **not** assume `0.1` is calibrated across reductions.
381
+ 2. **Warmup.** Linear `beta_sdpo` warmup over `sdpo_warmup_steps` (50–200) via
382
+ `_beta_sdpo_now()`. Early in training the policy is far from any sensible
383
+ distribution; a strong distillation pull then fights exploration. Let Dr. GRPO
384
+ establish a reward signal, then ramp SDPO in.
385
+ 3. **Per-token JSD clip = 0.05** (`sdpo_token_clip`, the OPSD `--jsd_token_clip`
386
+ default, `research/07` §1.1/§7). Prevents a few high-divergence **stylistic**
387
+ tokens at the error turn from dominating the distillation gradient — exactly
388
+ what it exists for.
389
+ 4. **Mask discipline.** SDPO supervises **only** `sdpo_loss_mask` tokens
390
+ (post-hint recovery). If the mask is all-ignore (empty-recovery error site,
391
+ ~67% of real Claude traces under `strip_thinking`), the collator already drops
392
+ the row (`data_collator.py` L308) — the channel silently no-ops rather than
393
+ emitting a degenerate ~ln(2) signal.
394
+
395
+ **KL-explosion / teacher-student-drift guardrails:**
396
+
397
+ - **SDPO drift is bounded-by-construction.** Teacher = same weights + hint,
398
+ stop-grad. A *wrong* hint produces a noisier target at one masked turn, not a
399
+ corrupted reward (`research/07` §1.3). There is no replay buffer and no
400
+ separate teacher to drift apart — single-epoch keeps teacher and student on the
401
+ same weights.
402
+ - **Watch `loss/sdpo_jsd` for collapse-to-zero or blow-up.** A *good* hint should
403
+ *raise* divergence at the hinted turn (it shifts mass toward the fix); a
404
+ persistently ~0 JSD means the hint isn't moving the teacher (prune that hint
405
+ source, `research/07` §7 item 7). A diverging JSD means the clip is too loose or
406
+ `beta_sdpo` too high — cap `beta_sdpo` and/or lower `token_clip`.
407
+ - **Guard the GRPO k1 KL independently.** Dr. GRPO's own `beta` (KL-to-ref) is
408
+ the explosion guard for the *policy*; keep it at its tuned value. SDPO's `beta_sdpo`
409
+ must not be conflated with it (§3). If total loss NaNs, bisect by zeroing
410
+ `beta_sdpo` — if it persists, the bug is in channel 1, not SDPO.
411
+ - **Shape-gate is a hard stop, logged.** If collator alignment regresses,
412
+ `_compute_sdpo_loss` skips the step with a warning rather than training on
413
+ aliased pad tokens (the silent-degenerate failure mode).
414
+
415
+ ---
416
+
417
+ ## 6. CPU-testable vs GPU-only, and the smoke plan
418
+
419
+ ### What is CPU-testable
420
+ - **The whole SDPO loss path** — student forward + hint-conditioned teacher
421
+ forward + masked JSD + `.backward()` + `Adam.step()` — on **Qwen2.5-0.5B** with
422
+ 1–2 error-bearing rollouts. This is *exactly* what
423
+ `examples/sdpo_real_trace_train_smoke/run.py` already proves for the free
424
+ `compose_loss` composer; the new test wraps it in the Dr. GRPO step.
425
+ - **The additive composition** `total = drgrpo + beta_sdpo·sdpo` and the warmup
426
+ schedule (assert `beta_sdpo` ramps, assert `loss/sdpo_jsd>0` on ≥1 step, assert
427
+ a watched param moves).
428
+ - **Dr. GRPO config pins** — assert `scale_rewards=False`, `num_iterations=1`,
429
+ k1-KL path selected (unit-level, no GPU).
430
+
431
+ ### What is GPU-only
432
+ - **Real TRL `GRPOTrainer` rollout generation** (vLLM/transformers generation at
433
+ batch size + group size K) — too slow on CPU for a live step; this is the
434
+ production "mode (1)" path in §4.3.
435
+ - **Async weight sync / off-policy control**, MoE router replay, multi-region
436
+ infra (`research/10` §4) — all out of scope for the SDPO channel test.
437
+ - **Convergence / quality** (does SDPO actually improve error-recovery) — needs a
438
+ real RL run.
439
+
440
+ ### Minimal smoke plan (`examples/sdpo_drgrpo_step_smoke/run.py`)
441
+ Analogous to the existing SDPO smoke; gates:
442
+
443
+ 1. Build a Dr. GRPO minibatch from 1–2 ingested **error-bearing** Qwen traces via
444
+ `ComposerDataCollator` (reuse `_discover_error_sessions` + the layered
445
+ `HintGenerator`); assert `sdpo_loss_mask` has ≥1 in-loss position.
446
+ 2. Construct a **synthetic Dr. GRPO channel-1 loss** standing in for
447
+ `super()._compute_loss` (advantages = `R - group_mean`, **no /std**; k1 KL
448
+ `−log r`; `sum()` reduction; no length-standardization) so the test runs
449
+ **without** spinning up TRL's full rollout machinery on CPU — mirrors how the
450
+ existing smoke uses LM-CE as the GRPO stub. Optionally also run a real
451
+ `GRPOTrainer._compute_loss` path under a `@pytest.mark.gpu` guard.
452
+ 3. `total = drgrpo_stub + beta_sdpo · _compute_sdpo_loss(...)`; `.backward()`;
453
+ `Adam.step()`.
454
+ 4. **Gates (exit 0 = PASS):** (a) all losses finite across steps; (b)
455
+ `loss/sdpo_jsd > 0` on ≥1 step (SDPO fired — shape-gate passed, hint
456
+ contributed real signal); (c) a watched parameter moved; (d) `beta_sdpo`
457
+ warmup increases monotonically; (e) zeroing `beta_sdpo` reproduces the pure
458
+ Dr. GRPO stub loss bit-for-bit (proves SDPO is purely additive). Exit 2 = SKIP
459
+ (no error-bearing sessions / no chat-template model), matching the existing
460
+ smoke's contract.
461
+
462
+ This is ~$0, CPU, single-process, and closes the one unproven edge: **a live
463
+ Dr. GRPO update step with the SDPO channel on**, end-to-end on a real HF model.
464
+
465
+ ---
466
+
467
+ ## 7. Citations
468
+
469
+ - **In-repo (authoritative substrate):** `composer_replication/loss.py`
470
+ (`compose_loss` 3-channel composer + `generalized_jsd_loss` call);
471
+ `recipes/prime_rl/composer_loss.py` (PRIME-RL adapter; SDPO `NotImplementedError`
472
+ at L257-268; parity-verified channel 1); `recipes/prime_rl/prime_rl_recipe.md`
473
+ (LossInputs shape, log-probs-not-logits limitation);
474
+ `trainer/composer_trainer.py` (`ComposerReplicationTrainer._compute_loss` and
475
+ `_compute_sdpo_loss` — the existing, correct SDPO head);
476
+ `trainer/data_collator.py` (`ctx_teacher_input_ids` + `sdpo_loss_mask` +
477
+ `_build_aligned_student_for_sdpo` equal-length alignment; hint-AND-recovery
478
+ gate L308); `examples/sdpo_real_trace_train_smoke/run.py` (the proven CPU
479
+ forward+backward+step harness this design's smoke extends).
480
+ - **`research/10-composer2-techreport-mining.md`** — the Dr. GRPO target:
481
+ length-standardization removed, no std-dev advantage normalization, **k1**
482
+ (`−log r`) KL not k3, Adam, single-epoch (a prompt never trained twice).
483
+ arXiv:2603.24477 §4.1.
484
+ - **`research/07-sdpo-hint-generator.md`** — `HintGenerator` Protocol + layered
485
+ composite, error-site detection alignment with ingestion `tool_error`, OPSD
486
+ `--jsd_token_clip` stabilizer, the "wrong hint is bounded-bad" property.
487
+ - **SDPO** — arXiv:2601.20802 (on-policy self-distillation; teacher = same model
488
+ conditioned on feedback, student stop-grad-free / teacher stop-grad, per-token
489
+ KL on the student trajectory). **OPSD** — arXiv:2601.18734 (privileged-info
490
+ teacher, generalized-JSD, token clip).
491
+ - **TRL** — `huggingface/trl@main` `trl/trainer/grpo_trainer.py`
492
+ (`GRPOTrainer` loss-override surface; confirmed via
493
+ `mcp_exa_get_code_context_exa`). The `_compute_loss(self, model, inputs)`
494
+ internal hook is what this repo already overrides; the public
495
+ `compute_loss(self, model, inputs, return_outputs=False,
496
+ num_items_in_batch=None)` HF-Trainer wrapper is shimmed in §4.1 for
497
+ version-robustness. (A confirmatory DeepWiki lookup timed out; the §4.1 guard
498
+ is written to work under either surface.)
499
+ ```
research/09-composer-blog-delta-2026.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Composer 2.5 Blog — Delta Note (re-extraction)
2
+
3
+ > **Re-extraction date:** 2026-05-28.
4
+ > **Method:** Live re-pull of [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5) and [the Composer 2 technical-report blog](https://cursor.com/blog/composer-2-technical-report) via `mcp_tavily_tavily_extract` (advanced); arXiv abstract pulls for the three footnote-1 papers; one Tavily secondary-source sweep (Jake Handy, DataCamp, Pulse2, Kingy, TechTalks).
5
+ > **Scope:** DELTAS ONLY vs `docs/COMPOSER_RECIPE_MAPPING.md` (2026-05-25), focused on (a) data generation / synthetic tasks / data mix / CPT data, and (b) targeted-textual-feedback / on-policy distillation. Verbatim blog text in the mapping doc is **not** re-derived here.
6
+
7
+ ## TL;DR
8
+
9
+ The **2.5 blog body is byte-for-byte unchanged** from what the mapping doc captured — no edits to the three method sections. All deltas come from (1) the **Composer 2 technical-report blog**, which is now cited as the K2.5 base-model source and which the mapping doc only listed as a stub "verify if needed," and (2) tighter sourcing on the three self-distillation papers. Net: a handful of real new facts, two corrections, and confirmation of the central reproducibility gap (hint generation) as still unstated.
10
+
11
+ ---
12
+
13
+ ## 1. Data generation / synthetic tasks / data mix / CPT — deltas
14
+
15
+ **Blog 2.5 verbatim sentences on data-gen (re-confirmed, unchanged from mapping doc):**
16
+ - *"To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run."*
17
+ - *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."*
18
+ - *"We use a range of approaches for creating synthetic tasks that are grounded in real codebases."*
19
+ - Feature-deletion paragraph + reward-hacking examples (Python type-checking cache; Java bytecode decompile) — verbatim as in mapping doc §2.
20
+
21
+ **DELTAS (not in / under-stated in COMPOSER_RECIPE_MAPPING.md):**
22
+
23
+ - **[DELTA — new emphasis]** The phrase *"we both **select for** and **create** harder tasks **dynamically throughout the run**"* is a **dynamic curriculum / online task-selection** signal. The mapping doc captured "Feature Deletion + 24 unnamed generators" but did **not** flag that task difficulty is filtered *online* (the model "begins to get most training problems correct," so hard tasks are up-weighted live). This is a data-*mix*/curriculum detail with direct replication impact: our generator suite needs a difficulty filter / pass-rate gate, not just a static task bank.
24
+ - **[DELTA — new authoritative source for CPT data mix]** The Composer 2 technical-report blog states the CPT data mix explicitly: *"continued pretraining on a data mix that **emphasizes code** to deepen the base model's coding knowledge"* and *"We find that **reducing pretraining loss improves downstream RL performance**, with better base knowledge reliably translating into a better agent."* The mapping doc marked "continued pretraining on heavily code-weighted data" as `[BLOG-VERIFIED]` from the 2.5 Muon section — but the **causal claim (CPT loss ↓ ⇒ RL performance ↑)** is new and is the stated *justification* for doing CPT at all. Relevant to our "skip CPT, start from Qwen3-Coder" decision: Cursor's own evidence says base-knowledge quality gates RL ceiling, which strengthens the case for starting from an already-code-tuned base.
25
+ - **[DELTA — new artifact]** There is now a **full Composer 2 arXiv technical report: [arXiv:2603.24477](https://arxiv.org/abs/2603.24477)** and a downloadable PDF at **`https://cursor.com/resources/Composer2.pdf`** (authored by Sasha Rush et al.). The report explicitly *"covers... ablations on the training recipe, our approach to agent behavior shaping, and the design of our evaluation suite."* The mapping doc cited only the blog stub and never the arXiv ID/PDF. **This PDF is the most likely place to resolve the data-mix weighting %, the RL algorithm name, and the hint-generation mechanism — none of which are in either blog.** → Recommend a dedicated follow-up extraction of Composer2.pdf.
26
+ - **[CONFIRM — "Anyrun"]** Mapping doc flagged "Anyrun" as possibly not Cursor-sourced. **Confirmed real:** the Composer 2 report blog says *"**Anyrun**, our internal compute platform for running hundreds of thousands of sandboxed coding environments."* It is a Composer-**2** artifact (carried into 2.5), correctly attributed. Resolves the mapping doc's open flag.
27
+ - **[NO CHANGE]** No data-mix percentages, token counts, generator inventory, or feature-deletion target-selection heuristic appear in either blog. Still the key data-gen reproducibility gap.
28
+
29
+ ---
30
+
31
+ ## 2. Targeted textual feedback / on-policy distillation — deltas
32
+
33
+ **Blog 2.5 verbatim (re-confirmed, unchanged):** the full "Targeted RL with textual feedback" section — hint→teacher / original-context→student / on-policy KL "for that turn only," applied "to a variety of model behaviors, from coding style to model communication." Matches mapping doc §1 exactly.
34
+
35
+ **DELTAS / sharper detail:**
36
+
37
+ - **[DELTA — verbatim mechanism nuance the mapping doc compressed]** The blog's causal sentence is more specific than the mapping doc's paraphrase: *"This hint **changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement**. For that turn only, we then **update the student weights towards to the new probabilities**."* Two implementation facts to lift exactly: (i) the teacher distribution is the **hint-conditioned forward pass of the same weights** (not a re-rollout), and (ii) the **student weights are updated** (the KL is a gradient-bearing loss on the student, teacher is stop-grad). Mapping doc had this right conceptually; the verbatim confirms teacher = stop-grad, student = trainable.
38
+ - **[DELTA — secondary-source confirmation, not new fact]** Multiple write-ups (Pulse2, TechTalks) independently describe the mechanism identically ("injects a local textual hint... teacher distribution... student... on-policy KL"). No secondary source reveals **how the hint is generated** — the single most important replication gap from the mapping doc **remains unresolved** across all live sources. No source claims templates vs. LLM-judge vs. learned generator.
39
+ - **[DELTA — coverage breadth]** Blog explicitly lists the behavior targets as **coding style, tool use, and model communication** (Pulse2/DataCamp corroborate). Mapping doc noted style+communication; "tool use" as a distinct third target is worth recording for the v0.1 hint-template taxonomy.
40
+ - **[WATCH — likely secondary-source conflation, flag do-not-cite]** TechTalks (bdtechtalks, 2026-05-25) introduces an **"SDFT"** continued-pretraining self-distillation story ("model generates its own reasoning... distills its own generated logic... constrains weight shift... adapt to a company's coding style without forgetting"). **This is NOT in the Cursor blog** and conflates the footnote-1 *continual-learning* paper (2601.19897) with Composer's RL method. Treat as journalist embellishment, **not** Cursor-stated. Recorded here so a future reader doesn't mistake it for ground truth.
41
+
42
+ ---
43
+
44
+ ## 3. Footnote-1 self-distillation papers — arXiv IDs resolve, one-line each
45
+
46
+ All three IDs **resolve live** (2026-05-28). Note the mapping doc's footnote ordering differs from the blog's; blog footnote-1 lists them in this order:
47
+
48
+ | arXiv | Title | Resolves? | One-line core method (abstract-level) |
49
+ |---|---|---|---|
50
+ | **[2601.19897](https://arxiv.org/abs/2601.19897)** | *Self-Distillation Enables Continual Learning* | ✅ v1 | Self-distillation as a continual-learning regularizer — anchor updates to the model's own prior outputs to acquire new data without catastrophic forgetting. (Abstract body not exposed in listing; title-level only.) |
51
+ | **[2601.20802](https://arxiv.org/abs/2601.20802)** | *Reinforcement Learning via Self-Distillation* (**SDPO**) | ✅ v2 (sub 28 Jan, rev 16 Feb 2026) | **The direct formalization of Composer's method.** "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy... converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model." |
52
+ | **[2601.18734](https://arxiv.org/abs/2601.18734)** | *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs* (**OPSD**) | ✅ v3, code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) | Single LLM; teacher = policy conditioned on privileged info, student = policy without it; per-token on-policy KL on the student's own rollouts. The original OPSD framework. |
53
+
54
+ **Corrections to the mapping doc:**
55
+ - **[CORRECTION — SDPO authorship]** Mapping doc says "Hübotter et al., 2026." **Confirmed and now precise:** Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, **Andreas Krause** (ETH Zürich group). 11 authors, CC-BY-4.0, v2 dated 2026-02-16.
56
+ - **[CORRECTION — SDPO abstract claim]** Mapping doc's quoted SDPO comparison-table framing ("environment / rich / on-policy") is a **reasonable gloss but not a verbatim abstract quote**. The actual abstract's strongest reproducibility-relevant claim is the **test-time** result: *"applying SDPO to individual questions at test time... achieving the same discovery probability as best-of-k... with 3x fewer attempts"* and that SDPO *"also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts."* The "successful-rollouts-as-implicit-feedback" trick is a **new lever** the mapping doc didn't capture — relevant if our hint generator is weak/absent (you can bootstrap hints from the model's own successful sibling rollouts). Benchmark: **LiveCodeBench v6**.
57
+ - **[CONFIRM — OPSD code]** `siyan-zhao/OPSD` link confirmed live in the arXiv comments field. The mapping doc's "lift the SDPO loss from OPSD (MIT)" plan stands; verify license on the repo directly (arXiv page shows CC-BY-4.0 for the *paper*, not necessarily the code).
58
+
59
+ ---
60
+
61
+ ## 4. RL algorithm / reward-hacking / behavioral reward — deltas
62
+
63
+ - **[NO CHANGE — RL algorithm name still absent]** Neither blog names the outer RLVR algorithm (PPO/GRPO/DAPO). Mapping doc's "PPO or GRPO variant `[EXTRAPOLATED]`" verdict stands. The Composer 2 report blog adds only that *"RL training improves both **average and best-of-K** performance, suggesting the model is learning new solution paths rather than just concentrating on known ones"* — an **algorithm-agnostic** observation (best-of-K ↑ implies exploration, not just sharpening). **arXiv:2603.24477 / Composer2.pdf is the place to find the algorithm name.**
64
+ - **[DELTA — reward-hacking mitigation, slightly sharper]** Blog wording: hacks were found *"using **agentic monitoring tools**"* and they *"demonstrate the increasing care necessary for large scale RL."* Still no specifics (no static-analysis/sandbox-lockdown detail). Mapping doc's "build the monitor in v0.1" plan is unaffected; no new implementation handle.
65
+ - **[DELTA — behavioral reward framing]** 2.5 blog: *"we improved behavioral aspects of the model like **communication style and effort calibration**. These dimensions are not well captured by existing benchmarks, but we find that they matter for real-world usefulness."* Mapping doc captured this as `[BLOG-VERIFIED]`. Delta: the blog **strongly implies** (via "we applied this method to a variety of model behaviors, from coding style to model communication") that **behavioral rewards are trained via the targeted-textual-feedback channel itself**, not a separate RM. The Composer 2 report blog also promises *"our approach to **agent behavior shaping**"* as a report section → another reason to pull Composer2.pdf.
66
+ - **[DELTA — net-new context, out of scope but record]** 2.5 blog adds a **SpaceXAI / Colossus 2** paragraph: *"Together with SpaceXAI, we're training a significantly larger model from scratch, using **10x more total compute**. With Colossus 2's **million H100-equivalents**..."* This is a *future from-scratch* model, **not** Composer 2.5's recipe — irrelevant to replication but absent from the mapping doc; recorded so it isn't mistaken for a 2.5 training fact.
67
+
68
+ ---
69
+
70
+ ## Action items surfaced by this delta pass
71
+
72
+ 1. **Pull `https://cursor.com/resources/Composer2.pdf` (arXiv:2603.24477)** — highest-value unread artifact; likely resolves data-mix %, RL-algo name, hint-generation, behavior-shaping. (Recommend a dedicated subagent.)
73
+ 2. **Add an online difficulty filter / pass-rate gate** to the synthetic-task generator plan (the "select for... dynamically" delta), not just a static bank.
74
+ 3. **Record the SDPO "successful-rollout-as-implicit-feedback" trick** as a hint-bootstrapping fallback for v0.1 when no external hint source exists.
75
+ 4. **Update mapping doc citations**: resolve the Anyrun flag (confirmed), add arXiv:2603.24477 + Composer2.pdf, correct SDPO author list, add LiveCodeBench-v6 as SDPO's eval, and append a do-not-cite note on the TechTalks "SDFT" conflation.
76
+ 5. Hint-generation mechanism **remains the #1 reproducibility gap** — unresolved by every live source checked.
research/10-composer2-techreport-mining.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Composer 2 Technical Report — Mining Notes (arXiv:2603.24477)
2
+
3
+ > **Extraction date:** 2026-05-28.
4
+ > **Primary source:** Full text of the **Composer 2 Technical Report** (Cursor Research Team; corresponding author Alexander M. "Sasha" Rush), PDF at `https://cursor.com/resources/Composer2.pdf` and arXiv `2603.24477` (v1 25 Mar 2026, v2 26 Mar 2026; cs.SE / cs.LG; "Aaron Chan and 53 other authors").
5
+ > **Method:** `mcp_tavily_tavily_extract` (advanced) on the PDF returned the **complete report body incl. References + Appendices A–C** (~148 KB). Cross-checked against `mcp_exa_crawling_exa` (full re-pull, identical text) and a `mcp_tavily_tavily_search` confirming the arXiv ID, abstract, "Dr. GRPO" passage, and the technical-report blog.
6
+ > **Tagging:** **[REPORT-VERIFIED]** = verbatim/paraphrase from the arXiv report. **[SECONDARY]** = blog/third-party. **[ABSENT]** = explicitly looked for, not in the report.
7
+ > **Scope note:** This report is **Composer 2**, not Composer **2.5**. Several recipe items the 2.5 blog advertises (targeted textual-feedback/hint distillation, "25× synthetic tasks", Sharded Muon) are **not** in this document — see §3 and the corrections box.
8
+
9
+ ---
10
+
11
+ ## TL;DR — did it resolve the three open questions?
12
+
13
+ | Open question (from delta note 09) | Resolved? | Answer |
14
+ |---|---|---|
15
+ | **RL algorithm NAME** | ✅ **YES** | A multi-sample policy-gradient (GRPO-family) algorithm built explicitly on **Dr. GRPO** [34]: GRPO with the **length-standardization term removed** and **no std-dev advantage normalization**. Optimizer = **Adam**, single-epoch, fixed group size, full-parameter. KL via the **k1 estimator (−log r)**. |
16
+ | **Data-mix weighting % / generator inventory / token counts** | ⚠️ **PARTIAL** | CPT is a **3-phase code-dominated mix** (32k → 256k → SFT) but **no %s and no token counts** are given. RL task mix is given only as a **category histogram (Fig. 3)**, not generator names or weights. No "Feature Deletion" generator inventory (that was 2.5-blog). |
17
+ | **HINT-generation mechanism (targeted textual feedback)** | ❌ **ABSENT** | **The hint/teacher-student textual-feedback mechanism is NOT in the Composer 2 report at all.** It is a Composer **2.5** feature. Composer 2 shapes behavior with **auxiliary scalar rewards + a nonlinear length penalty**, not hint distillation. The #1 reproducibility gap remains unresolved by this artifact. |
18
+
19
+ **Net:** The report fully answers the RL-algorithm question (the single biggest win), partially answers data-mix, and does **not** touch hint generation. It also delivers a large amount of previously-unstated **infrastructure** detail (Anyrun internals, async RL stack, MoE router replay, precision recipe) and a **correction** to two prior assumptions (optimizer is Adam not Muon; base is Kimi K2.5 1.04T/32B).
20
+
21
+ ---
22
+
23
+ ## 1. Data generation / CPT data-mix / curriculum [§3, §4]
24
+
25
+ ### 1.1 Continued pretraining (CPT) — [REPORT-VERIFIED]
26
+ - Base model = **Kimi K2.5** [67], a **1.04T-param / 32B-active MoE** (Appendix B; selected over GLM-5 and DeepSeek V3.2 on internal *FreshBench* knowledge, *State Tracking* (LoCoDiff-style), and *codebase perplexity*; agentic benchmarks **deliberately excluded** from base-model selection "as agentic and long-horizon capabilities can drastically change during the RL stage").
27
+ - CPT is **"a large code-dominated data mix"** done in **three phases**:
28
+ 1. **Bulk of compute at 32k sequence length**,
29
+ 2. a shorter **long-context extension phase to 256k**,
30
+ 3. a short **SFT phase on targeted coding tasks**.
31
+ - Training: **MXFP8 on NVIDIA B300s**, **AdamW** optimizer. Eval loss on internal codebase **"decreases log-linearly"** over the run.
32
+ - **Causal CPT→RL claim (the justification for doing CPT):** they replicate the recipe on **Qwen3-Coder-30B-A3B** at **three log-spaced compute levels (small/medium/large)**, each + identical SFT + identical RL run, and show **"cross-entropy loss is … predictive of downstream RL performance"** (Fig. 2). → Direct support for our "start from an already-code-strong base" decision.
33
+ - **Multi-Token Prediction (MTP):** extra MTP layers [17,11] trained from scratch on the same mix for speculative decoding, via **self-distillation** to the main LM head's logits; MTP layers cut from the **middle** of the CPT run and trained jointly during the long-context + SFT phases. *(This is the only "self-distillation" in the report — it is for MTP/spec-decode, NOT for hints.)*
34
+ - **[ABSENT]** No data-mix percentages, no token/byte counts, no list of CPT data sources.
35
+
36
+ ### 1.2 RL task distribution & dynamic curriculum — [REPORT-VERIFIED]
37
+ - RL tasks **"run in environments that emulate real Cursor sessions as closely as possible."** Problem distribution **"reflects the most common use cases"**; **Fig. 3** gives the category breakdown (x-axis "% of Problems", ~0–40%): **Iterate On Feature, Debugging, New Feature, Refactor, Understanding Codebase, Documentation, Testing, Code Review, Optimize, Devops, Migration, Deletion, Other.** *(This is the closest the report gets to a "data mix" — categorical, not weighted %s, no generator names.)*
38
+ - **Dynamic difficulty curriculum (verbatim):** *"In later stages of training, we use simple heuristics—such as **number of turns and thinking tokens of rollouts**—to **upsample increasingly harder data points**."* → Confirms delta note 09's "select for harder tasks dynamically" as an **online up-sampling gate keyed on turns + thinking-token count**. Replication handle: rank tasks by rollout length/turn-count, up-weight the long-tail late in training.
39
+ - **[ABSENT]** No synthetic-task **generator inventory** (no "Feature Deletion" et al.), no "25× synthetic tasks" figure, no synthetic-vs-real split. Those are Composer **2.5**-blog claims and are **not** in this report.
40
+
41
+ ---
42
+
43
+ ## 2. RL ALGORITHM [§4.1] — [REPORT-VERIFIED], the headline result
44
+
45
+ **Algorithm family:** *"a policy gradient algorithm with multiple samples per prompt [53 = DeepSeekMath/GRPO, 2 = REINFORCE-style RLOO] and a fixed group size."* Operates in the **single-epoch regime** (a prompt is **never trained on twice**). **Adam** optimizer; **full-parameter** update. Highly **asynchronous** (independent train + rollout workers).
46
+
47
+ **Specific GRPO modifications (the "name" + the deltas):**
48
+ - Built on **Dr. GRPO** [34 = Liu et al., *Understanding R1-Zero-like training*, arXiv 2503.20783]: verbatim *"As in Dr. GRPO, … crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage."*
49
+ - **Remove the length-standardization term from GRPO** (it "introduces a length bias").
50
+ - **Do NOT normalize group advantages by their standard deviation** — std-norm "results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."
51
+ - **Overlong-rollout masking [78 = DAPO/Yu et al.]: NOT used.** They *"did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length"*; the self-summary system limits overlong cases anyway. *(So: Dr. GRPO-style, explicitly NOT DAPO's overlong masking; DAPO [78] and GSPO [82] are cited but as related work / for router-replay, not adopted wholesale.)*
52
+
53
+ **KL regularization — exact formulation [§4.1, Fig. 4]:**
54
+ - Uses **KL(q‖p) = E_{x∼q}[−log r(x)], r(x)=p(x)/q(x)** for regularization (like DeepSeekMath [53] and Kimi k1.5 [66]).
55
+ - **Chooses the k1 estimator `k1 = −log r`** over the popular **k3 = (r−1) − log r** [Schulman 52], because (citing Amini et al. [6]) k3's variance "increases drastically as p and q diverge" — at large KL the k3 estimate variance is "extremely large." (k2 is unbiased-ish but biased per their note.) → **Replication handle: use the simple `−log r` KL penalty, not the k3 unbiased estimator, for agentic long-horizon RL.**
56
+
57
+ **Async-rollout infra / off-policy control [§4.1, §6.2]:**
58
+ - Minimize off-policyness via **fast weight sync + in-flight (mid-rollout) weight updates**, *"similar to **PipelineRL** [48]"* — inference workers update weights mid-rollout so later tokens are less off-policy.
59
+ - **MoE router replay [38, 82]:** inference engine returns selected expert indices per token per MoE layer; training forward pass **overrides the router's expert assignment to match** (router still computes gating scores so gradients flow). They **extend** replay by **filtering replayed experts whose gating scores fall below a plausibility threshold from the router's own top-k, replacing them with the router's candidates** — reduces p99 numerics mismatch between inference and training forward passes. *(Critical for MoE-base RL stability; directly relevant if we RL a MoE.)*
60
+
61
+ **Reward structure [§4.1–4.2]:**
62
+ - Reward based on **"code's correctness, succinctness, and conformance to software engineering principles."**
63
+ - **best-of-K does NOT trade off vs average:** both rise together over training (Fig. 5) → RL is *expanding* solution coverage, not just sharpening (notable vs the "RL only concentrates mass" literature [79,32,8,74,61]).
64
+
65
+ **Reward-hacking safeguards — [ABSENT/THIN]:** This report does **not** contain the Python-typecheck-cache / Java-bytecode reward-hack anecdotes (those are 2.5-blog). The only related safeguards here are **strict tool-argument checks** and **tool removal for steerability** in training environments (§6.2), and general monitoring for **emergent behaviors** (§4.2). No dedicated "agentic monitoring tool" section.
66
+
67
+ ---
68
+
69
+ ## 3. Targeted textual feedback / hint distillation — **[ABSENT]**
70
+
71
+ **Finding: The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback / on-policy KL-to-hint-conditioned-teacher mechanism.** Searched the full text for hint / teacher / student / textual feedback / distill — the only "distillation" is **MTP self-distillation to the LM head's logits** (§3.1, spec-decode), unrelated to behavior shaping.
72
+
73
+ **What Composer 2 does for behavior shaping instead [§4.2 "Agent Behavior"] — [REPORT-VERIFIED]:**
74
+ - **Auxiliary scalar rewards**, not hints: *"we apply an array of auxiliary rewards … rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished."*
75
+ - **Reactive reward addition:** they "monitor the model for emergent behaviors and occasionally introduce additional behavior rewards" (examples observed: leaving long CoT in code comments; collapsing to terminal-tool-only).
76
+ - **Nonlinear length / effort penalty (exact equation):**
77
+ `C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))`, concave-down & increasing, where **x = a weighted combination of {thinking tokens, tool-calling tokens, tool-output tokens, final-message tokens, # tool calls, # turns}** and `k, q` are curvature hyperparameters (Fig. 6). Goal: be quick on easy tasks, think longer on hard tasks; observed to induce **parallel tool calls**.
78
+ - **Self-Summarization [§4.1, from Composer 1.5 [64]]:** rollouts are chains joined by self-summaries; **final reward is assigned to all tokens in the chain** (up-weights good agent turns *and* the summaries that enabled them; down-weights lossy summaries). Reduces error vs prompt-based compaction while using fewer tokens and reusing KV cache.
79
+
80
+ > **Implication for the replication framework:** To reproduce Composer **2.5**'s hint mechanism we still must look elsewhere — the **SDPO (arXiv 2601.20802) / OPSD (2601.18734)** papers from delta note 09 remain the only formalizations, and **how Cursor generates the hint text itself is still unstated in every Cursor artifact.** Composer 2's behavior shaping (auxiliary rewards + the length-penalty equation above) is a **fully reproducible, hint-free alternative** we can adopt for v0.1.
81
+
82
+ ---
83
+
84
+ ## 4. Other replication-relevant detail [§6 Infrastructure, §5 CursorBench, App.] — [REPORT-VERIFIED]
85
+
86
+ **Optimizer — CORRECTION:** report says **AdamW (CPT) / Adam (RL)**. **There is NO "Sharded Muon" in the Composer 2 report** — the Muon claim came from the 2.5 blog and should be tagged 2.5-only / re-verified, not assumed for Composer 2.
87
+
88
+ **Parallelism / sharding layout — CORRECTION to "HSDP":**
89
+ - Prior stacks used **FSDP + EP + TP** (EP coupled to TP). **Composer 2 decouples EP from TP** and uses **Context Parallelism (CP)** as the primary long-context axis (less comm than TP; CP folded into the FSDP dim). **No mention of "HSDP"** — the doc says **FSDP/ZeRO [50,81] + CP + decoupled EP**, **DeepEP** [80] for token dispatch/combine.
90
+ - **Exact degrees:** **EP=8, CP=2 for CPT**; **EP=8, CP=8 for RL.** MLA attention with latent-vector all-gather trick; Llama-style 2×CP chunk load-balancing [33].
91
+ - **Global sequence packing** before each RL step to balance DP compute across variable-length rollouts (accounts for quadratic attention cost).
92
+
93
+ **Precision recipe [§6.1]:**
94
+ - **MoE forward = a novel NVFP4 variant**: BF16→FP4E2M1 with **FP8E4M3 per-block scales (block 16) + FP32 per-token scales** (per-tensor FP32 scales were "fragile" → batch-variance collapse + future-token leakage/biased grads). **MoE backward = standard MXFP8** (FP8E4M3 values, FP8E8M0 scales per 32-elt block) — afford higher precision since backward runs only on the train cluster. Trainer forward must **numerically match inference** for stability. IEEE `__fdiv_rn` **critical** for NVFP4 (fast-approx diverges ~100 RL steps); fast-approx OK for MXFP8.
95
+ - Kernels in **CUDA/PTX/ThunderKittens-ParallelKittens** [56,59]; FA4 backward (DeepSeek QK192/V128 shapes) co-developed w/ Colfax; GEMMs open-sourced into ThunderKittens [21].
96
+
97
+ **RL infra [§6.2] — 4 decoupled services (training / environments / inference / evals):**
98
+ - **Training:** fully async on **Ray [42] + PyTorch**, centralized **reconciler** w/ slot-based sample lifecycle + staleness-balancing scheduler; **futures**-based eager exec; Ray object store w/ NVMe spill; fault-tolerant to process-group level, warm-standby nodes, live code updates; **policy-aware rollout-level + group-level checkpointing** (codebase memory snapshots; advantage-tagged sequences w/ policy versions to NFS). Production run spanned **3 GPU regions + 4 CPU regions.**
99
+ - **Anyrun (verbatim internals):** *"an internal compute platform built for running untrusted code at scale … the same platform that powers Cloud Agents and Automations."* Global router → multiple Anyrun clusters; each cluster schedules **>500 pods/sec**, manages **hundreds of thousands of pods/cluster**; **each pod = a dedicated Firecracker VM** (full dev env incl. browser/GUI for computer use); x86+ARM mix; pressure-aware bin-packing. **Forking & snapshotting at filesystem + memory level** (→ mid-trajectory checkpoint, post-rollout introspection); same-node fork preferred else live-migrate. **Anygress** egress proxy (TCP-layer redirect via injected root CA, header stripping). **Shadow deployment of the Cursor backend** for faithful tools; tools dynamically per-environment (stricter arg checks / tool removal in training).
100
+ - **Inference:** **partner = Fireworks AI.** Every step, weights synced to inference via **S3 with per-rank delta compression** (RL diffs compress to "a handful of GB" for the 1T model); sharded upload/download; geo-distributed US+EU clusters reconstruct from the shared delta chain (no direct train↔inference connectivity).
101
+ - **Online evals:** pinned production backend + Cursor client per eval job; lease an eval deployment, move GPUs, cross-region weight sync.
102
+
103
+ **CursorBench (eval-suite design) [§5]:**
104
+ - Internal suite from **real Cursor engineering-team agent sessions** (avoids train-set contamination). Motivated by 4 failure modes of public benchmarks (domain mismatch, prompt over-specification, contamination/overfit, narrow scope).
105
+ - **Quantified hardness vs public sets:** median **181 lines changed** (vs 7–10 for SWE-bench Verified/Multilingual) and median prompt length **390 chars** (vs 1,185–3,055) → larger + more under-specified. Versioned (**CursorBench-3** > 2× the median task size of v1; Table 1 uses CursorBench-3).
106
+ - **Targeted sub-evals:** intent, instruction-following, **eager-editing** (don't edit when you shouldn't), code-quality (LLM-judge rubrics), **interruption** (mid-rollout user feedback). Built by "identifying dimensions, selecting eliciting data points, writing rubrics."
107
+ - **Headline results (Table 1):** Composer 2 = **CursorBench 61.3 / SWE-bench Multilingual 73.7 / Terminal-Bench 61.7**; Kimi K2.5 base = 36.0 / 65.1 / 47.3 → large RL+CPT lift.
108
+
109
+ **Ablations actually present (for "ablations on the training recipe"):**
110
+ 1. **CPT→RL** (Qwen3-Coder-30B, 3 compute levels; Fig. 2) — CE loss predicts RL reward.
111
+ 2. **KL estimator** k1 vs k3 (Fig. 4) — variance argument for k1.
112
+ 3. **GRPO term removals** — length-standardization & std-norm removed (qualitative justification, no head-to-head curve).
113
+ 4. **Overlong masking** — tried, no benefit at small scale, dropped.
114
+ 5. **NVFP4 scaling scheme** (per-token vs per-tensor) and **IEEE vs fast-approx division** — stability ablations.
115
+ 6. **best-of-K vs average** over training (Fig. 5).
116
+ *(No single consolidated "leave-one-out recipe component" ablation table; ablations are distributed and partly qualitative.)*
117
+
118
+ ---
119
+
120
+ ## Corrections / cautions for the mapping doc
121
+
122
+ - **[CORRECTION] Optimizer:** Composer **2** uses **Adam/AdamW**, **not Muon**. Treat "Sharded Muon" as a **2.5-blog-only, unverified-for-2** claim.
123
+ - **[CORRECTION] Sharding:** report describes **FSDP+CP+decoupled-EP (EP=8/CP=2 CPT, EP=8/CP=8 RL)**, **not "HSDP."**
124
+ - **[CORRECTION] "RL algorithm = PPO/GRPO `[EXTRAPOLATED]`"** → now **[REPORT-VERIFIED] Dr. GRPO-style** (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE router-replay). DAPO overlong-masking explicitly rejected.
125
+ - **[CONFIRM] Anyrun** real, with full internals (Firecracker VMs, >500 pods/s, fork/snapshot, Anygress).
126
+ - **[CONFIRM] base model = Kimi K2.5 1.04T/32B** (over GLM-5, DeepSeek V3.2).
127
+ - **[CAUTION] Hint mechanism, "25× synthetic tasks", Feature-Deletion generator, reward-hack anecdotes are NOT in this (Composer 2) report** — do not cite this PDF for them; they are Composer 2.5-blog material.
128
+
129
+ ---
130
+
131
+ ## Sources
132
+
133
+ - **[PRIMARY, REPORT-VERIFIED]** Cursor Research Team, *Composer 2 Technical Report*, arXiv:**2603.24477** (v1 2026-03-25, v2 2026-03-26; cs.SE/cs.LG; corr. Alexander M. Rush). Full text via PDF `https://cursor.com/resources/Composer2.pdf` (Tavily advanced extract, full body+refs+App. A–C) and cross-checked via Exa full crawl (identical). HTML/TeX also available at `https://arxiv.org/abs/2603.24477`, `https://arxiv.org/pdf/2603.24477`.
134
+ - **[SECONDARY]** Cursor blog, *A technical report on Composer 2* (Sasha Rush) — `https://cursor.com/blog/composer-2-technical-report` (abstract-level; confirms Kimi K2.5 base + CPT-loss→RL claim).
135
+ - **[CONTEXT]** Key cited methods: Dr. GRPO (Liu et al., arXiv 2503.20783 [34]); DAPO (Yu et al. [78], 2503.14476/NeurIPS'25); GSPO (Zheng et al., 2507.18071 [82]); DeepSeekMath/GRPO [53]; PipelineRL (2509.19128 [48]); MoE router alignment (Ma et al., 2510.11370 [38]); KL-estimator variance (Amini et al. [6]); Schulman KL note [52]; DeepEP [80]; ThunderKittens/ParallelKittens [56,59].
136
+ - **Prior internal note:** `research/09-composer-blog-delta-2026.md` (read first; this note discharges its action item #1 and supplies corrections to the RL-algorithm/optimizer/sharding rows of `docs/COMPOSER_RECIPE_MAPPING.md`).