research: Composer 2.5 data-gen + targeted-textual-feedback deep-research wave

Phase-3 of the deep-work-loop bringing Composer 2.5's dataset-generation and
targeted-RL-with-textual-feedback methods into the framework. 5 new research
docs (130KB), all with primary-source citations:

- 06-feature-deletion-datagen.md: Feature Deletion env design (online pass-rate
difficulty gate, reward-hacking safeguards, 5 OSS substrates w/ HF ids +
licenses, deletion mechanics, FeatureDeletionEnv + TRL reward_fn adapter).
- 07-sdpo-hint-generator.md: layered HintGenerator (template -> raw-error ->
LLM-judge -> introspection -> learned -> SDPO sibling-bootstrap), actual
template strings, judge prompt, slots into existing CollatorConfig hook.
- 08-sdpo-grpo-integration.md: ComposerGRPOTrainer(GRPOTrainer) design adding
SDPO KL at error turns on top of Dr. GRPO; PRIME-RL recipe blocked (log-probs
only), TRL subclass is the host; CPU smoke plan.
- 09-composer-blog-delta-2026.md: blog re-extraction delta; found the Composer 2
arXiv tech report; SDPO successful-rollout-as-implicit-feedback lever.
- 10-composer2-techreport-mining.md: arXiv:2603.24477 mined. RESOLVED RL algo =
Dr. GRPO (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE
router-replay). Hint-gen confirmed ABSENT from every Cursor artifact ->
SDPO/OPSD reconstruction is the only path. Corrections: optimizer Adam not
Muon; sharding FSDP+CP+decoupled-EP not HSDP. Hint-free behavior-shaping
alternative (aux scalar rewards + nonlinear length/effort penalty eq).

Files changed (5) hide show

research/06-feature-deletion-datagen.md +346 -0
research/07-sdpo-hint-generator.md +366 -0
research/08-sdpo-grpo-integration.md +499 -0
research/09-composer-blog-delta-2026.md +76 -0
research/10-composer2-techreport-mining.md +136 -0

research/06-feature-deletion-datagen.md ADDED Viewed

	@@ -0,0 +1,346 @@

+# Feature-Deletion Data Generation → `FeatureDeletionEnv` Design Brief
+> **Author date:** 2026-05-28.
+> **Scope:** Turn Composer 2.5's *Feature Deletion* synthetic-task approach (component **#2 "Synthetic data at 25× scale"**, mapping row **(b)**, reward-hacking row **(g)**) into a real, usable data-generation subsystem for this framework. This is the design brief that mapping-table row (b) calls for ("Build 1 generator (Feature Deletion) as OpenEnv-compatible env").
+> **Method:** Live blog re-extraction (`mcp_tavily_tavily_extract` advanced) of [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5); substrate-dataset cards pulled live from HF/arXiv; TRL `GRPOTrainer` reward-fn convention confirmed against the [TRL source](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py).
+> **Tag convention** (matches `docs/COMPOSER_RECIPE_MAPPING.md`): **`[BLOG-VERIFIED]`** = verbatim in the 2.5 blog; **`[INFERRED]`** = reasonable extrapolation from blog + open-source prior art; **`[EXTRAPOLATED]`** = our design addition, not Cursor-stated.
+> **Reads-before:** `docs/COMPOSER_RECIPE_MAPPING.md` (§2, rows b/g) and `research/09-composer-blog-delta-2026.md` (online-curriculum delta). This file does **not** re-derive the Targeted-RL / SDPO material (that is rows (d) and `research/05`); it is the data-gen side only.
+---
+## 0. TL;DR
+Feature Deletion is a **self-verifying inverse task**: take a repo whose test suite passes, *programmatically remove* a testable feature (so the suite now fails), and reward the agent for reimplementing it until the suite passes again. The reward is the pre-existing test suite — **verifiable, no human labels, no golden patch needed at reward time**. We can stand this up immediately on five open substrates (SWE-Gym, SWE-bench-Lite, R2E-Gym, SWE-rebench, OpenHands/Nemotron trajectories) by *inverting* their `(repo, base_commit, gold_patch, test_patch)` tuples instead of generating deletions from scratch. The two non-obvious requirements the blog forces on us: (1) an **online pass-rate difficulty gate** (the curriculum is dynamic, not a static bank — per the delta note), and (2) **anti-reward-hacking sandboxing** because Cursor observed the model recovering deleted signatures from bytecode/type-check caches. Below: a `FeatureDeletionEnv` Gym/OpenEnv class sketch wired for TRL `GRPOTrainer` (reward = test pass-fraction), the deletion mechanics (AST/file/coverage-mapped), the sandbox lockdown spec, and a CPU-pool cost model.
+---
+## 1. What "Feature Deletion" is, exactly `[BLOG-VERIFIED]`
+Verbatim from the Synthetic-data section of the blog (re-pulled 2026-05-28):
+> *"During RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, **we both select for and create harder tasks dynamically throughout the run**. Composer 2.5 is trained with **25x more synthetic tasks** than Composer 2.*
+>
+> *We use a range of approaches for creating synthetic tasks that are grounded in real codebases. For example, one synthetic approach is **feature deletion**. For these tasks the agent is given a codebase with a large set of tests, and asked to **delete code and files in such a way that the codebase remains functional while specific testable features are removed**. The synthetic task is to **reimplement the feature, and the tests are used as a verifiable reward**."*
+**Parse of the mechanism (note the two-agent / two-phase structure the blog implies):**
+| Phase | Actor | Action | Output |
+|---|---|---|---|
+| **Deletion (task-construction)** | a *deleter* (model or program) | "delete code and files in such a way that the codebase remains functional while specific testable features are removed" | a `broken_repo` + the set of tests that now fail |
+| **Reimplementation (the training task)** | the *policy under training* | reimplement the deleted feature | a diff scored by the test suite |
+- The deletion step is itself non-trivial: it must keep the codebase *otherwise functional* (imports resolve, unrelated tests still pass) while making **specific testable features** fail. `[BLOG-VERIFIED]` that this constraint exists; `[INFERRED]` that in practice this means *partition the test suite into a kept set (`PASS_TO_PASS`) that must still pass and a target set (`FAIL_TO_PASS`) that the deletion must break.*
+- **Verifiable reward = the original test suite.** No golden patch is needed at reward time (only at task-construction time, to know what "done" looks like). This is the key property that makes the env cheap to run in an RL loop.
+- The blog does **not** state: how deletion targets are *selected*, the deleter model, the languages beyond the Python/Java implied by the reward-hacking examples, or the difficulty heuristic. Those are the reproducibility gaps (consistent with `research/09` §1 "NO CHANGE" line).
+**Relationship to the inverse-of-SWE-bench framing** `[INFERRED]`: A SWE-bench-style instance is `(repo@base_commit, problem_statement, gold_patch, test_patch, FAIL_TO_PASS, PASS_TO_PASS)`. Feature Deletion is the *constructive inverse*: instead of mining a human PR that fixed a bug, we **apply `revert(gold_patch)` (or an AST-deletion) to a passing repo** to manufacture the broken state, then ask the agent to re-derive `gold_patch`. This means **every existing SWE-* instance is already a ready-made Feature-Deletion task** — we get the deletion "for free" by reverting the gold patch. This is the single most important leverage point in this brief (see §4, §5).
+---
+## 2. The online difficulty curriculum `[BLOG-VERIFIED]` framing → `[EXTRAPOLATED]` design
+The delta note (`research/09` §1, DELTA-new-emphasis) is explicit that *"select for and create harder tasks dynamically throughout the run"* is a **dynamic curriculum / online task-selection** signal, not a static bank. Our generator must therefore expose a **pass-rate gate**, not just emit tasks. Design:
+**Difficulty signal.** For each candidate task `t`, maintain an EMA of the policy's group pass-rate `p̂(t)` (TRL GRPO already samples `G` completions per prompt — we get `G` pass/fail observations per task per step *for free*). Define difficulty `d(t) = 1 − p̂(t)`.
+**Two levers the blog names — "select for" and "create":**
+1. **`select for` (online filter / replay weighting)** `[EXTRAPOLATED]`. Sampling weight over the task pool:
+   - **Drop solved tasks:** if `p̂(t) > τ_easy` (e.g. 0.9) for `k` consecutive evaluations, retire `t`. This is exactly the blog's "begins to get most training problems correct" symptom.
+   - **Drop impossible tasks:** if `p̂(t) < τ_hard` (e.g. 0.02) after `k` exposures, quarantine `t` (likely a broken-task or reward-hack-only task — see §3).
+   - **Up-weight the frontier:** sample weight `w(t) ∝ p̂(t)·(1−p̂(t))` (max variance ≈ max learning signal; standard curriculum-RL choice, cf. PLR / TD-error curricula). Keeps the policy on tasks it solves ~50% of the time.
+2. **`create` (difficulty escalation)** `[EXTRAPOLATED]`. When the pool's median `p̂` rises above a band, the *generator* produces harder tasks. Concrete escalation knobs, easiest→hardest:
+   - **Deletion span:** single-function → whole-class → whole-file → cross-file feature (more `FAIL_TO_PASS` tests, more LoC to reconstruct).
+   - **Hint starvation:** strip docstrings / type hints / the deleted function's signature from the surrounding context (also a reward-hack-surface reduction, §3).
+   - **Coupling:** delete a feature that several `PASS_TO_PASS` tests *also* exercise, so the agent must reconstruct it without breaking neighbors.
+   - **Multi-feature:** delete `n>1` independent features in one repo (reward = fraction of target tests passing — naturally graded).
+**Implementation handle (curriculum is a thin layer over the task pool):**
+```python
+# datagen/curriculum.py  [EXTRAPOLATED]
+import math, random, collections
+class PassRateCurriculum:
+    """Online difficulty gate. Fed (task_id, n_pass, n_total) after each GRPO group."""
+    def __init__(self, tau_easy=0.90, tau_hard=0.02, ema=0.3, retire_k=3):
+        self.p   = collections.defaultdict(lambda: 0.5)   # EMA pass-rate
+        self.seen = collections.Counter()
+        self.retired, self.quarantined = set(), set()
+        self.tau_easy, self.tau_hard, self.ema, self.retire_k = tau_easy, tau_hard, ema, retire_k
+    def update(self, task_id, n_pass, n_total):
+        r = n_pass / max(n_total, 1)
+        self.p[task_id] = (1 - self.ema) * self.p[task_id] + self.ema * r
+        self.seen[task_id] += 1
+        if self.seen[task_id] >= self.retire_k:
+            if self.p[task_id] > self.tau_easy: self.retired.add(task_id)
+            elif self.p[task_id] < self.tau_hard: self.quarantined.add(task_id)  # likely broken / hack-only
+    def weight(self, task_id):
+        if task_id in self.retired or task_id in self.quarantined: return 0.0
+        p = self.p[task_id]
+        return p * (1 - p) + 1e-3          # frontier (max-variance) weighting
+    def sample(self, task_ids, k):
+        live = [t for t in task_ids if self.weight(t) > 0]
+        w = [self.weight(t) for t in live]
+        return random.choices(live, weights=w, k=k) if live else []
+    def median_pass(self, task_ids):       # escalation trigger for the generator
+        vals = sorted(self.p[t] for t in task_ids if t not in self.retired)
+        return vals[len(vals)//2] if vals else 0.0
+```
+The trainer feeds `update(...)` from each GRPO group; the generator polls `median_pass(...)` and, when it crosses a band, emits a harder batch (more deletion span / more starvation). This is the minimal realization of "select for + create harder tasks dynamically."
+---
+## 3. Reward-hacking failure modes & programmatic safeguards `[BLOG-VERIFIED]` problem, `[EXTRAPOLATED]` mitigations
+The blog (re-pulled verbatim) is the ground truth on the *problem*:
+> *"One downstream consequence of large scale synthetic task creation is that it can cause unexpected reward hacking… In one example, the model found a **leftover Python type-checking cache and reverse-engineered the format to find a deleted function signature**. In another, it was able to **find and decompile Java bytecode to reconstruct a third-party API**. We were able to find and diagnose these problems using **agentic monitoring tools**, but they demonstrate the increasing care necessary for large scale RL."*
+The blog gives **no mitigation specifics** beyond "agentic monitoring tools" (confirmed unchanged in `research/09` §4). So the following are our design `[EXTRAPOLATED]`, consistent with mapping row (g) ("Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find`/`strings`/`unzip`").
+**Root cause:** Feature Deletion deletes *source*, but compilers/type-checkers/build tools leave **shadow copies of the deleted information** elsewhere in the working tree. The agent recovers the answer instead of reconstructing it. Two defense layers:
+### 3a. Pre-task scrubbing (eliminate the leak at construction time)
+Run after deletion, before the repo is handed to the agent:
+| Leak source | Scrub action |
+|---|---|
+| Python bytecode | delete all `**/__pycache__/`, `*.pyc`, `*.pyo` |
+| Type-check caches | delete `.mypy_cache/`, `.pyre/`, `.pytype/`, `.dmypy.json`, `.ruff_cache/`, `.pytest_cache/` |
+| Compiled Java/JVM | delete `*.class`, `target/`, `build/`, `*.jar`/`*.war` containing the deleted API; strip bundled deps |
+| Build/dist artifacts | delete `dist/`, `*.egg-info/`, `*.so`, `build/`, `.tox/` |
+| VCS history | run the agent on a **squashed, detached worktree** — no `.git` (else `git log -p` / `git show` recovers the deletion) |
+| Editor/LSP indexes | delete `.idea/`, `.vscode/`, `*.code-workspace`, ctags/`tags`, `.cache/` |
+| Docs/stubs | delete generated `*.pyi` stubs and built HTML/Sphinx docs that embed signatures |
+```python
+# datagen/scrub.py  [EXTRAPOLATED]
+import shutil, pathlib
+LEAK_DIRS  = {"__pycache__",".mypy_cache",".pyre",".pytype",".ruff_cache",
+              ".pytest_cache","target","build","dist",".tox",".idea",".vscode",".git",".cache"}
+LEAK_GLOBS = ["*.pyc","*.pyo","*.class","*.jar","*.war","*.so","*.pyi",
+              ".dmypy.json","tags","*.egg-info"]
+def scrub(root: str):
+    root = pathlib.Path(root)
+    for p in root.rglob("*"):
+        if p.is_dir() and p.name in LEAK_DIRS:
+            shutil.rmtree(p, ignore_errors=True)
+    for g in LEAK_GLOBS:
+        for p in root.rglob(g):
+            (shutil.rmtree(p, ignore_errors=True) if p.is_dir() else p.unlink(missing_ok=True))
+```
+### 3b. Runtime sandbox lockdown (block recovery if a leak survives)
+- **Tool denylist in the agent's shell harness** (matches mapping row g): no `find`, `strings`, `unzip`, `jar`, `javap`, `unzip`, `objdump`, `grep -r` over non-source dirs, `uncompyle6`/`decompyle3`/`pycdc`, `cfr`/`procyon`/`fernflower` (Java decompilers), `git`. Implement as an allowlisted command shim, not a blocklist (allowlist is the safe default).
+- **Network egress = none** (can't `pip download` the original package to read the API). Already required for determinism.
+- **Read-only mounts for everything except the editable source tree**; site-packages of the *target package itself* removed from the image.
+### 3c. Post-hoc monitoring ("agentic monitoring tools" analogue) `[EXTRAPOLATED]`
+A cheap programmatic monitor over the trajectory, run *after* a rollout passes, to retro-reject hacks:
+- **AST diff check:** the agent's accepted diff must contain *new function/class bodies* (AST nodes with statements), not just an import that re-exposes a surviving symbol. Reject solutions whose passing is explained purely by `import`/`from … import *` of a non-scrubbed module.
+- **Provenance scan:** flag any trajectory whose tool calls touched `*.class`, `*.pyc`, `.mypy_cache`, `.git`, or invoked a denied binary (defense-in-depth telemetry even with the shim).
+- **Static byte-similarity gate:** if the agent's reconstructed function is a near-exact byte copy of the (held-out) gold body *and* the agent never "wrote" it incrementally (single paste), flag for review — distinguishes reconstruction from recovery.
+- These produce a **reward mask**: `reward = test_pass_fraction × (0 if hack_detected else 1)`. This is the concrete realization of mapping row (g)'s "+ RM-based penalty" without needing a learned RM in v0.1.
+---
+## 4. Open-source drop-in substrates
+Every substrate below ships SWE-bench-shaped tuples `(repo, base_commit, patch=gold, test_patch, FAIL_TO_PASS, PASS_TO_PASS)`. **The Feature-Deletion mapping is identical for all of them: revert `patch` (or AST-delete the functions it touches) to manufacture `broken_repo`; `FAIL_TO_PASS` is the reward target; `PASS_TO_PASS` is the "stay-functional" guard the blog demands.** Licenses verified live 2026-05-28.
+| Substrate | HF dataset id | Scale | What it provides | License | FD-env mapping |
+|---|---|---|---|---|---|
+| **SWE-bench / Lite / Verified** | `SWE-bench/SWE-bench`, `SWE-bench/SWE-bench_Lite`, `SWE-bench/SWE-bench_Verified` | 2,294 / 534 / 500 | Real GitHub issue→PR tuples, per-version test envs, pre-built Docker images (`xingyaoww/sweb.eval.*`, `swebench/*`) | dataset CC-BY-4.0; **per-repo source licenses vary** (mostly Apache/MIT/BSD) | Lite/Verified are the **v0.0 smoke-test set** (mapping row b: "use SWE-bench-lite only" in v0.0). Revert gold patch → FD task. Already-built images = no env-construction cost. |
+| **SWE-Gym / SWE-Gym-Raw** | `SWE-Gym/SWE-Gym`, `SWE-Gym/SWE-Gym-Raw` | 2,438 (11 repos) / ~tens-of-k raw | Same schema as SWE-bench but **purpose-built for training** (train split, not a held-out benchmark → no contamination worry); pre-built Docker images; verifier-training support. arXiv:2412.21139 (ICML 2025). | check repo (SWE-Gym tooling MIT; **instances inherit upstream repo licenses**) | **Primary v0.1 FD substrate** (mapping row b: "build Feature Deletion"). 2.4k clean training tasks, each invertible into an FD task with `n` difficulty escalations (§2). |
+| **R2E-Gym (V1 / Subset)** | `R2E-Gym/R2E-Gym-V1`, `R2E-Gym/R2E-Gym-Subset`, `R2E-Gym/SWE-Bench-Lite` | **8.1K** executable envs (13 repos); Subset = non-overlapping w/ SWE-bench | **SWE-GEN engine**: procedurally generates executable envs *directly from commits* w/o human issues, + execution-assisted back-translation for problem statements + **pre-built Docker images**. arXiv (R2E-Gym, Jain et al. 2025). | check repo (Apache-2.0 tooling typical; per-instance upstream licenses) | Best **scale** substrate and the closest open analogue to Composer's "grounded in real codebases" generator. Its commit-diffs *are* feature-deletion candidates by construction (the commit added a feature; revert = delete it). |
+| **SWE-rebench** | `nebius/SWE-rebench` (+ `nebius/SWE-bench-extra`, `nebius/SWE-agent-trajectories`) | **21,336** tasks, 3,468 repos | Fully-automated mining pipeline; ships `install_config`, `requirements`, `environment`, `docker_image`, and **LLM-scored difficulty + clarity annotations** per task; `FAIL_TO_PASS`/`PASS_TO_PASS`/`FAIL_TO_FAIL`/`PASS_TO_FAIL`. arXiv:2505.20411 (NeurIPS 2025). | **dataset CC-BY-4.0**; per-instance `license_name` field provided (56 distinct) — *honor it per instance* | **Largest + the only one with built-in difficulty scores** → seeds the §2 curriculum's cold-start `p̂(t)` prior before any rollouts exist. The per-instance `docker_image` + `install_config` removes the 200-hr env-build bottleneck SWE-Gym reported. |
+| **OpenHands trajectories** (via Nemotron-SWE-v1) | `nvidia/Nemotron-SWE-v1` | 59K agent trajectories | OpenHands-framework SWE trajectories (Qwen3-Coder-480B teacher), issues sourced from SWE-Gym + R2E-Gym-Subset | **CC-BY-4.0** (subsets BSD-3 / Apache-2.0 / MIT per viewer) — "ready for commercial use" | Not an FD-env itself — it's **SFT/distill warm-start + a source of gold trajectories** for the §3c monitor's "what legitimate reconstruction looks like" reference, and for `research/05` trace-replay. Use as cold-start, not as the RL env. |
+**Practical selection:** v0.0 = SWE-bench-Lite (smoke). v0.1 = SWE-Gym (clean train) + SWE-rebench (scale + difficulty prior). R2E-Gym when we need to push past ~25k tasks toward the "25×" spirit. Nemotron/OpenHands trajectories = SFT warm-start + monitor reference, not the RL env. **License rule baked into the loader:** carry each instance's upstream repo license; filter out copyleft (GPL/AGPL) repos for any artifact we redistribute (we redistribute *deletions/diffs*, which are derivative works).
+---
+## 5. Deletion mechanics: producing the `(broken_repo, test_command, golden_diff)` tuple
+Two construction paths; we implement both and let the curriculum pick granularity (§2).
+### Path A — Gold-patch reversion (cheap, the default for SWE-* substrates) `[INFERRED]`
+The substrate already tells us *exactly* which lines implement a testable feature: the gold `patch`. So:
+1. `git apply patch` onto `base_commit` → **functional repo, all tests pass** (this is the substrate's "solved" state).
+2. `golden_diff := patch` (what the agent must re-derive); `broken_repo := apply(reverse(patch))` → the feature is gone.
+3. `FAIL_TO_PASS` tests now fail (target); `PASS_TO_PASS` tests still pass (the "remains functional" guard — verify this, see §5c).
+4. **Scrub** (§3a), strip `.git`, freeze image.
+### Path B — Coverage-mapped AST deletion (true synthetic generation, no human PR needed) `[EXTRAPOLATED]`
+This is the path that generalizes beyond mined PRs and lets us "create harder tasks" at will (R2E-Gym-style):
+1. **Run the suite with coverage** (`coverage.py` / `pytest --cov`) on the passing repo to get a `test → {file:line-ranges}` map.
+2. **Pick a deletion target** by granularity knob:
+   - *function-level:* parse with `ast`/`libcst`, choose a `FunctionDef`/`AsyncFunctionDef`/`ClassDef` whose lines are covered by ≥1 test and that has high *test selectivity* (covered by few `PASS_TO_PASS` so the repo stays functional after removal).
+   - *file-level:* a module imported by exactly the target tests.
+   - *feature-level:* the transitive closure of a public symbol via an import/def graph (`grimp`/`pydeps`), bounded so unrelated tests survive.
+3. **Delete** via CST (replace body with `raise NotImplementedError` *or* remove the node and its now-dead imports). CST (`libcst`) preserves formatting and lets us re-insert a stub signature or not (the §2 "hint starvation" knob).
+4. **`golden_diff` = the removed nodes** (held out for the monitor; never shown to the agent).
+### 5c. Guaranteeing the remaining tests exercise the deleted code (the blog's hard constraint)
+The blog requires *"the codebase remains functional while specific testable features are removed."* Enforce as a **construction-time validation gate** — a task is only emitted if all four hold:
+```python
+# datagen/build_task.py  [EXTRAPOLATED]  (pseudocode over a sandboxed runner)
+def validate_task(repo_passing, broken_repo, target_tests, keep_tests, run):
+    # 1. baseline sanity: full suite passes on the unbroken repo
+    assert run(repo_passing, target_tests + keep_tests).all_pass
+    res = run(broken_repo, target_tests + keep_tests)
+    # 2. deletion actually breaks the target feature (tests now FAIL)
+    assert all(res.failed(t) for t in target_tests)          # FAIL_TO_PASS non-empty & failing
+    # 3. deletion left the rest functional (collection works, neighbors pass)
+    assert res.collected_ok and all(res.passed(t) for t in keep_tests)  # PASS_TO_PASS guard
+    # 4. solvability: gold diff restores green (the task is actually achievable)
+    assert run(apply(broken_repo, golden_diff), target_tests + keep_tests).all_pass
+    return Task(...)                                          # else discard
+```
+Gate (4) is what prevents the §2 "impossible task" quarantine pile-up — every emitted task is provably solvable by `golden_diff`. Gate (3) is the literal encoding of "remains functional." **Task tuple emitted:**
+```python
+# datagen/schema.py  [EXTRAPOLATED]
+from dataclasses import dataclass, field
+@dataclass(frozen=True)
+class FeatureDeletionTask:
+    task_id: str
+    repo: str                       # e.g. "getmoto/moto"
+    base_commit: str
+    broken_image: str               # docker tag of the scrubbed broken repo (frozen env)
+    test_command: str               # e.g. "python -m pytest -q"
+    fail_to_pass: tuple[str, ...]   # reward target (must go red→green)
+    pass_to_pass: tuple[str, ...]   # functional guard (must stay green)
+    golden_diff: str = field(repr=False)   # HELD OUT — monitor/solvability only, never in obs
+    granularity: str = "function"   # function|file|feature  (curriculum escalation)
+    deleted_symbols: tuple[str, ...] = ()  # for AST-provenance monitor (§3c)
+    upstream_license: str = "unknown"      # carried from substrate; gates redistribution
+    difficulty_prior: float = 0.5          # seeded from SWE-rebench LLM score if available
+```
+---
+## 6. `FeatureDeletionEnv` — OpenEnv/Gym-style design for TRL `GRPOTrainer` + verifiers
+**Integration contract.** TRL's `GRPOTrainer` takes a dataset of prompts and one or more **reward functions** with the calling convention `reward_fn(prompts: list[str], completions: list[str], **kwargs) -> list[float]` (the dataset's non-prompt columns are passed through as `**kwargs`; confirmed against the [TRL `grpo_trainer.py`](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py) source and the `RewardFunc` type-alias fix in TRL PR #5246). So the env exposes **two faces**: a Gym/OpenEnv face (`reset`/`step` for multi-turn agentic rollout via OpenEnv, mapping row c) and a **`reward_fn` adapter** that GRPO calls directly. Reward = **test pass fraction** (`|FAIL_TO_PASS passing| / |FAIL_TO_PASS|`), naturally graded for multi-feature tasks, masked by the hack monitor (§3c).
+```python
+# envs/feature_deletion_env.py  [EXTRAPOLATED]
+# Gym/OpenEnv-style env + a TRL GRPO reward adapter. Execution happens in the
+# locked-down sandbox of §3b; this class is the orchestration shell.
+from dataclasses import dataclass
+from datagen.schema import FeatureDeletionTask
+@dataclass
+class StepResult:
+    observation: str            # tool output / test stdout shown to the agent
+    reward: float               # only nonzero on a terminal "submit" step
+    done: bool
+    info: dict
+class FeatureDeletionEnv:
+    """One task per episode. Sandbox = allowlisted shell, no net, scrubbed tree (§3)."""
+    def __init__(self, sandbox, monitor, max_turns: int = 40):
+        self.sandbox, self.monitor, self.max_turns = sandbox, monitor, max_turns
+        self.task: FeatureDeletionTask | None = None
+    # ---- Gym/OpenEnv face (multi-turn agentic rollout) ----
+    def reset(self, task: FeatureDeletionTask) -> str:
+        self.task, self.turns = task, 0
+        self.sandbox.boot(task.broken_image)           # read-only except editable src; egress off
+        # NOTE: golden_diff / deleted_symbols are NEVER placed in the observation.
+        return self._render_prompt(task)               # task desc + failing-test names + tool list
+    def step(self, action: dict) -> StepResult:
+        self.turns += 1
+        if action["type"] == "submit" or self.turns >= self.max_turns:
+            return self._grade()
+        obs = self.sandbox.exec(action)                # edit / run-tests / read-file (allowlisted)
+        return StepResult(obs, 0.0, False, {"turn": self.turns})
+    def _grade(self) -> StepResult:
+        r = self.sandbox.run_tests(self.task.test_command,
+                                   self.task.fail_to_pass + self.task.pass_to_pass)
+        frac = r.n_pass(self.task.fail_to_pass) / max(len(self.task.fail_to_pass), 1)
+        guard_ok = r.all_pass(self.task.pass_to_pass)          # "remains functional"
+        hacked = self.monitor.flag(self.sandbox.trajectory(),  # AST + provenance (§3c)
+                                   self.task.deleted_symbols)
+        reward = frac * (1.0 if guard_ok and not hacked else 0.0)
+        return StepResult(r.stdout, reward, True,
+                          {"frac": frac, "guard_ok": guard_ok, "hacked": hacked})
+    # ---- TRL GRPOTrainer face (reward_fn(prompts, completions, **kwargs)->list[float]) ----
+    def reward_fn(self, prompts, completions, *, task_id=None, **kwargs):
+        rewards = []
+        for comp, tid in zip(completions, task_id):            # task_id passed via dataset column
+            task = self.registry[tid]
+            self.reset(task)
+            res = self._run_completion(comp)                   # replay agent turns from `comp`
+            rewards.append(res.reward)
+            self.curriculum.update(tid, n_pass=int(res.reward > 0), n_total=1)  # §2 feedback
+        return rewards
+```
+**Wiring to GRPO (the dataset carries `task_id`; curriculum reweights the sampler):**
+```python
+# train/grpo_fd.py  [EXTRAPOLATED]
+from trl import GRPOTrainer, GRPOConfig
+env = FeatureDeletionEnv(sandbox=LockedSandbox(...), monitor=HackMonitor(...))
+ds  = build_prompt_dataset(tasks)                  # columns: prompt, task_id  (+ curriculum weights)
+trainer = GRPOTrainer(
+    model="Qwen/Qwen3-Coder-7B",                   # v0.0 base (mapping row a)
+    args=GRPOConfig(num_generations=8, ...),       # G=8 → 8 pass/fail obs per task per step → §2
+    reward_funcs=[env.reward_fn],                  # reward = masked test pass-fraction
+    train_dataset=ds,
+)
+trainer.train()
+```
+This slots into the same RLVR base as rows (c)/(d); the **SDPO hint-distill channel (row d, `research/05`) is orthogonal** and stacks on top — Feature Deletion supplies the *verifiable scalar reward* that SDPO's KL rides on. The `verifiers` library can wrap `reward_fn` for env composition if we run multiple generators.
+---
+## 7. Cost & feasibility at scale (CPU pools)
+Feature-Deletion is **embarrassingly parallel and CPU-bound** — no GPU in the data-gen path (matches mapping §"Synthetic data: Generators run on CPU pool… Embarrassingly parallel"). Two cost buckets:
+**(A) Task construction (one-time per task).** `[EXTRAPOLATED]` estimates:
+- Path A (gold-patch revert) over a pre-built substrate image: `git apply -R` + scrub + one validation suite run. Dominated by the test run: **~30 s–5 min CPU** per task depending on suite size. Validation gate (§5c) needs ~2 suite runs (broken + gold-restored) → call it **~2–10 min CPU/task**.
+- Path B (AST deletion): + a coverage run (~1.5–3× a normal suite run) + AST/CST manipulation (<1 s). **~5–20 min CPU/task.**
+- **Throughput:** a 64-vCPU pool at ~8 min/task and 8 concurrent runners ≈ **~60 tasks/hr/node** → ~1,400 tasks/day/node. Inverting all 21k SWE-rebench instances ≈ **~15 node-days** on one 64-vCPU box, trivially parallel across nodes. Reaching a "25×-spirit" pool of ~50k–60k tasks (R2E-Gym 8.1k + SWE-rebench 21k + multi-feature/granularity escalations) is **<1 week on a modest CPU pool**.
+- **Storage/images:** reuse substrate Docker images (SWE-Gym `xingyaoww/sweb.eval.*`, SWE-rebench per-instance `docker_image`) → **near-zero env-build cost**, sidestepping the "200 hours of manual env setup" bottleneck SWE-Gym reported. We only add a thin scrubbed overlay layer per task (~MBs).
+**(B) Reward evaluation (recurring, in the RL loop).** This is the real running cost, not construction: each GRPO step runs `G` rollouts × (agent turns + final test run). Test execution is CPU; agent generation is the GPU/inference cost shared with rows (c)/(d). Levers: cache the broken image warm, run only `FAIL_TO_PASS + PASS_TO_PASS` (not the full suite), and retire solved tasks via §2 so we stop paying for tasks the model already aced ("begins to get most training problems correct").
+**Feasibility verdict:** **Green.** Construction is cheap and one-time; the curriculum keeps the live pool small; the only nontrivial recurring cost (sandboxed test execution) is shared with any RLVR coding env we'd build anyway. The binding constraints are *engineering* (sandbox lockdown §3, validation gate §5c) and *licensing hygiene* (§4), not compute.
+---
+## 8. Open questions / reproducibility gaps (carried from blog silence)
+1. **Deletion-target selection heuristic** — blog silent (`research/09` §1 "NO CHANGE"). We propose coverage-selectivity (§5 Path B); Cursor's actual heuristic is unknown.
+2. **Deleter model vs. program** — blog implies an agent deletes ("asked to delete code… such that the codebase remains functional"); we default to *programmatic* deletion (cheaper, deterministic, no second model). An LLM-deleter is a v0.2 escalation.
+3. **The other ~24 generators** — Feature Deletion is "one synthetic approach… a range of approaches"; the rest are unnamed. Out of scope here; this brief delivers the one named generator.
+4. **"Agentic monitoring tools" internals** — unspecified; our §3c monitor is a best-effort programmatic stand-in.
+5. **Composer2.pdf (arXiv:2603.24477)** — flagged by `research/09` action-item #1 as the likely home of data-mix % and generator inventory; **not yet extracted**. Recommend a follow-up pull before scaling the generator suite.
+---
+## Sources
+- **Cursor blog** — *Introducing Composer 2.5*, [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5) (re-extracted 2026-05-28; §1/§3 quotes verbatim).
+- **Composer 2 technical report** — [arXiv:2603.24477](https://arxiv.org/abs/2603.24477) / [cursor.com/resources/Composer2.pdf](https://cursor.com/resources/Composer2.pdf) (unread; flagged in `research/09`).
+- **SWE-bench** — datasets guide [swebench.com/SWE-bench/guides/datasets](https://www.swebench.com/SWE-bench/guides/datasets); HF `SWE-bench/SWE-bench`, `SWE-bench/SWE-bench_Lite`, `SWE-bench/SWE-bench_Verified`.
+- **SWE-Gym** — *Training Software Engineering Agents and Verifiers with SWE-Gym*, [arXiv:2412.21139](https://arxiv.org/abs/2412.21139) (ICML 2025); HF [`SWE-Gym/SWE-Gym`](https://huggingface.co/datasets/SWE-Gym/SWE-Gym) (2,438 inst, 11 repos), `SWE-Gym/SWE-Gym-Raw`; [github.com/SWE-Gym/SWE-Gym](https://github.com/SWE-Gym/SWE-Gym).
+- **R2E-Gym** — *Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents* (Jain et al. 2025); [r2e-gym.github.io](https://r2e-gym.github.io); HF org [huggingface.co/R2E-Gym](https://huggingface.co/R2E-Gym) (`R2E-Gym-V1`, `R2E-Gym-Subset`, `SWE-Bench-Lite`); 8.1K executable envs, 13 repos.
+- **SWE-rebench** — *An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents*, [arXiv:2505.20411](https://arxiv.org/pdf/2505.20411) (NeurIPS 2025); HF [`nebius/SWE-rebench`](https://huggingface.co/datasets/nebius/SWE-rebench) (21,336 tasks, 3,468 repos, CC-BY-4.0 + per-instance `license_name`), `nebius/SWE-bench-extra`, `nebius/SWE-agent-trajectories`.
+- **OpenHands trajectories** — HF [`nvidia/Nemotron-SWE-v1`](https://huggingface.co/datasets/nvidia/Nemotron-SWE-v1) (59K OpenHands trajectories, CC-BY-4.0; issues from SWE-Gym + R2E-Gym-Subset).
+- **TRL `GRPOTrainer`** — reward-fn convention `reward_fn(prompts, completions, **kwargs)->list[float]`, [trl/trainer/grpo_trainer.py](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py), [`RewardFunc` alias PR #5246](https://github.com/huggingface/trl/pull/5246), [GRPO docs](https://huggingface.co/docs/trl/main/en/grpo_trainer).
+- **Internal:** `docs/COMPOSER_RECIPE_MAPPING.md` (§2, rows b/g), `research/09-composer-blog-delta-2026.md` (online-curriculum delta), `research/05-trace-replay-distillation.md` (orthogonal distill channel).

research/07-sdpo-hint-generator.md ADDED Viewed

	@@ -0,0 +1,366 @@

+# SDPO Hint Generation: How to Build the Teacher's "Privileged Info" for Composer's Targeted RL with Textual Feedback
+> **Research date:** 2026-05-28.
+> **Scope:** Resolves the **#1 open replication question** flagged in `docs/COMPOSER_RECIPE_MAPPING.md` §1 and `research/09-composer-blog-delta-2026.md` §2: *how are the hints generated?* This doc maps OPSD/SDPO's "privileged information" onto Composer's "hint," builds a cheapest→richest **taxonomy of hint sources**, ships a **concrete template library with actual strings**, specifies the **LLM-judge fallback prompt**, aligns **error-site detection** with `ingestion/trace_examples.py`, and proposes a **layered `HintGenerator` design** that slots into the existing `CollatorConfig.hint_generator` hook.
+> **Method:** Primary-source pulls (Tavily advanced) of the SDPO abstract + method/ablation HTML ([arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802)), the OPSD method HTML + GitHub README ([arXiv:2601.18734v3](https://arxiv.org/abs/2601.18734), [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)), audited against the current `hint_generator.py`, `trainer/data_collator.py`, and `ingestion/trace_examples.py`.
+---
+## TL;DR
+The hint **is** the teacher's privileged-information conditioning variable. Cursor never says how hints are generated, but the two cited papers bracket the answer precisely:
+- **OPSD** conditions the teacher on `y⋆` = **ground-truth answer / reference CoT** — the strongest, most "privileged" signal, available only when you hold the solution.
+- **SDPO** generalizes this to **environment feedback** that you *already have for free* at training time, and crucially ablates **three feedback types**: (1) a **successful sibling rollout** ("sample solution"), (2) the **environment output** (runtime errors / judge text), and (3) the **student's own original attempt**. The teacher is the *same weights* conditioned on that feedback; the student is the same weights without it; loss is a per-token KL on the student's trajectory, gradient through the student only, teacher stop-grad.
+Composer's "hint" is therefore **not one thing** — it is *whatever cheap, locally-available text shifts the teacher distribution toward the correct continuation*. That reframing makes the open question tractable: build a **layered generator** that tries the cheapest source first and escalates only on miss:
+```
+template-by-error-kind  →  raw-tool-error-as-hint  →  LLM-judge hint  →  SDPO successful-sibling bootstrap
+   (free, deterministic)      (free, structural)        (~$0.0005/site)      (free, needs a rollout group)
+```
+The current `hint_generator.py` implements **only the first layer** (5 templates) and is the right v0.1 starting point. This doc specifies layers 2–4 and a clean `HintGenerator` Protocol so they compose behind the existing `Callable[[str, dict], str | None]` hook with **zero collator changes**.
+---
+## 1. How OPSD & SDPO obtain the teacher's "privileged info" → Composer's "hint"
+Both methods build teacher and student from **a single LLM** and differ only in *what extra text the teacher gets to see*. That extra text is the privileged-information variable. Composer's "hint inserted into the local context" is exactly this variable.
+### 1.1 OPSD — privileged info = ground-truth answer
+> *"The teacher policy is provided with privileged information `y⋆`, such as the **ground-truth answer or a reference chain-of-thought**, while the student policy conditions only on the problem `x`. … the teacher policy `p_T(·|x, y⋆)` conditions on both the problem and the privileged answer, whereas the student policy `p_S(·|x)` observes only the problem. We preserve the on-policy training paradigm by sampling trajectories `ŷ` exclusively from the student policy, which then receives dense, token-level supervision from the privileged teacher policy."* — OPSD, [arXiv:2601.18734v3](https://arxiv.org/html/2601.18734v3)
+OPSD loss (Eq. 8, verbatim structure):
+```
+L(θ) = E_{(x, y⋆)~S} [ E_{ŷ~p_S(·|x)} [ D( p_T ‖ p_S )(ŷ | x) ] ]
+```
+> *"Gradients are backpropagated only through the student policy `p_S`, while the teacher `p_T` acts as a fixed full-distribution target conditioned on privileged information `(x, y⋆)`."*
+**Map to Composer:** `y⋆` ≙ the **hint**. In OPSD the hint is maximally strong (the answer itself). In a coding agent you rarely have the answer at an arbitrary turn — so the OPSD form is the *upper bound* of hint strength, usable only for the subset of error sites where a reference exists (e.g. the deleted code in a Feature-Deletion task, or a known-good tool signature).
+Two OPSD implementation handles that transfer directly (from the [GitHub README](https://github.com/siyan-zhao/OPSD)):
+- **`--reason_first`**: *"Prepend an explicit rationalization to the teacher context before distillation."* This is OPSD's own knob for **same-model introspection** (taxonomy class (d) below) — the teacher is first asked to rationalize *why* the privileged info implies the answer, then distilled. Evidence the introspection-hint path is real and works.
+- **`--jsd_token_clip`** (default `0.05`): *"Clip the JSD loss for each token … This can improve stability by preventing **stylistic tokens from dominating** the training signal."* Directly relevant to Composer's **style/communication** behavior targets — without clipping, distilling a style hint can be dominated by a few high-divergence stylistic tokens. Our collator's `sdpo_loss_mask` already isolates post-hint tokens; token-clipping is the complementary per-token stabilizer.
+### 1.2 SDPO — privileged info = environment feedback (three sources, ablated)
+> *"SDPO treats the current model **conditioned on feedback** as a self-teacher and distills its feedback-informed next-token predictions back into the policy. … SDPO leverages the model's ability to **retrospectively identify its own mistakes in-context**."* — SDPO abstract, [arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802)
+SDPO explicitly ablates **three feedback types present in a verifiable coding environment** ([SDPO method HTML](https://arxiv.org/html/2601.20802v2)):
+> *"we ablate the three types of feedback present in a verifiable environment like code generation: **the sample solution** (if a successful rollout is available in the current rollout group), **the environment output** (such as runtime errors), and **the student's original attempt**."*
+This is the load-bearing finding for our taxonomy. Each maps to a distinct hint source:
+| SDPO feedback type | What it is | Composer "hint" equivalent | Taxonomy class (§2) |
+|---|---|---|---|
+| **Sample solution** | A *successful sibling rollout* from the same prompt's rollout group | Bootstrap hint: "Here is a working approach: …" | **(f)** SDPO successful-sibling bootstrap |
+| **Environment output** | Runtime error / judge text returned by the env | Raw tool-error text spliced as the hint | **(b)** raw-tool-error-as-hint |
+| **Student's original attempt** | The model's own failed text, re-shown | Self-introspection prompt | **(d)** same-model introspection |
+The key SDPO lever for the **hint-absent case** (called out in `09-composer-blog-delta-2026.md` §3 action item 3):
+> *"SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by **using successful rollouts as implicit feedback for failed attempts**."*
+i.e. when there is **no external hint source**, you can still manufacture privileged info by letting the teacher condition on a *sibling rollout that passed*. This is free (you already paid for the rollout group under GRPO) and is the natural last fallback before giving up on an error site.
+### 1.3 The exact mechanism nuance to preserve (from the Cursor blog, via delta doc §2)
+> *"This hint **changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement**. For that turn only, we then **update the student weights towards the new probabilities**."*
+Two facts the hint generator must respect:
+1. **Teacher = hint-conditioned forward pass of the same weights** (not a re-rollout, not a separate model). The generator's job is only to *produce the text spliced into the teacher context* — the collator (`_build_hint_injected_trace`) already does the splicing, and the trainer does the forward pass.
+2. **Student weights are trainable; teacher is stop-grad.** The generator never touches the loss; it only conditions the teacher. So **a wrong hint is bounded-bad** — it produces a noisier teacher target at one masked turn, not a corrupted reward. This is why we can afford cheap/heuristic hints and only escalate on miss.
+---
+## 2. Taxonomy of hint sources — cheapest → richest
+For each class: applicability, cost, and which **Composer behavior class** it covers. Composer's three stated behavior targets are **tool use, coding style, and model communication** (`09-composer-blog-delta-2026.md` §2), plus **effort calibration** (blog §"behavioral aspects"). Tool errors are the cheap, structural case; style/communication/effort are the hard cases templates can't reach.
+| # | Hint source | How obtained | Cost / latency | Determinism | Tool err | Style | Comms | Effort |
+|---|---|---|---|---|:--:|:--:|:--:|:--:|
+| **(a)** | **Hardcoded template by error_kind** | Pattern-match `error_kind`, fill slots (`available_tools`, `tool_schema`) | **Free**, ~µs | Fully deterministic | ✅ strong | ⚠️ rigid | ⚠️ rigid | ❌ |
+| **(b)** | **Raw tool-error text as hint** | Pass the env's error string through (optionally truncated) | **Free**, ~µs | Deterministic | ✅ strong | ❌ | ❌ | ❌ |
+| **(c)** | **LLM-judge natural-language hint** | Call a cheap judge model with `(state, erroring_action, tool_output)` | ~$0.0003–0.001/site, ~0.5–2 s | Stochastic | ✅ | ✅ | ✅ | ✅ |
+| **(d)** | **Same-model introspection** | Re-prompt the *training model* to critique its own failed turn (OPSD `--reason_first`) | **Free GPU** (1 extra gen), ~0.3–1 s | Stochastic | ✅ | ✅ | ✅ | ✅ |
+| **(e)** | **Learned hint generator** | A small fine-tuned model trained to emit hints (defer to v0.2+) | Train-time cost + inference | Stochastic | ✅ | ✅ | ✅ | ✅ |
+| **(f)** | **SDPO successful-sibling bootstrap** | Pick a *passing* rollout from the same prompt's GRPO group; condition teacher on it | **Free** (reuses rollout group), ~µs to select | Deterministic given group | ✅ | ✅ | ✅ | ✅ (shows a *terser* success) |
+**Reading of the table:**
+- **(a)+(b) cover the tool-use behavior class almost entirely** and are free + deterministic → make them the default first layer. This is the "easy case" the mapping doc warns about (`COMPOSER_RECIPE_MAPPING.md` §"Why deferring … is the right call", point 2): they *do not* validate the harder behavior cases.
+- **Style / communication / effort-calibration are NOT pattern-matchable.** "This explanation was wasteful" or "this code violates house style" requires class **(c)**, **(d)**, or **(f)**. This is the real content of the open question.
+- **(f) is the unique unlock** when no external hint source exists *and* you don't want an API call: it manufactures privileged info from the model's own successes. It is the natural fallback and also the cheapest source for style/comms/effort because a *successful sibling* implicitly demonstrates the desired style/terseness without anyone writing a rule.
+- **(e) learned generator** is explicitly v0.2 (`COMPOSER_RECIPE_MAPPING.md` table row (d): "+ learned hint generator"). Out of scope to build now; the Protocol below makes it a drop-in later.
+**Recommended escalation order (rationale):** deterministic-and-free before stochastic, structural before semantic, no-API before API. → `(a) → (b) → [(c) xor (d)] → (f)` with `(f)` as the "nothing else fired but we have a passing sibling" backstop.
+---
+## 3. Concrete template library (actual strings)
+This **extends** the current `hint_generator.py` registry (which already ships `tool_not_found`, `json_decode`, `type_error`, `runtime_error`, `repeated_failure`). New/expanded templates below are written to the **same `HintContext` TypedDict** and same `dispatch(error_kind, ctx)` contract, so they register without touching the collator. All keep the blog's *"Reminder: …"* register (the one verbatim example Cursor published was `"Reminder: Available tools are…"`).
+| error_kind | Trigger | Hint string (template) |
+|---|---|---|
+| `tool_not_found` | invalid tool name | `Reminder: Available tools are: {tool_list}. The tool you called does not exist — use one of these.` |
+| `malformed_args` / `json_decode` | unparseable tool args / JSON | `Reminder: tool arguments must be a single valid JSON object. Common mistakes: single quotes (use double quotes), trailing commas, unescaped newlines inside strings, or wrapping the JSON in markdown fences.` |
+| `schema_mismatch` / `type_error` | args parse but violate schema | `Reminder: \`{tool_name}\` expects arguments matching this schema:\n  {tool_schema}\nYour call is missing/mistyped: {bad_fields}. Re-issue with arguments matching the schema.` |
+| `failing_test` | test suite returns non-zero / assertion | `Reminder: the test \`{test_name}\` is still failing: {assertion_excerpt}. Re-read the failing test's expectations and adjust the implementation to satisfy them — do not modify the test.` |
+| `lint_style` | linter/formatter non-zero exit | `Reminder: this change violates the project style ({linter}: {rule_id} — {rule_msg}). Match the surrounding code's conventions (imports, naming, formatting) before proceeding.` |
+| `wasteful_action` | redundant/no-op action (effort calibration) | `Reminder: this step repeated work already done (you already {prior_action}). Skip redundant reads/searches and act on what you know; prefer the most direct path to the goal.` |
+| `repeated_failure` | same error_kind ≥3× consecutively | `Reminder: this approach has failed {n} times. Step back and try a different strategy: read more of the surrounding code, search for an existing working example, or decompose the task differently.` |
+| `verbose_communication` | judge-flagged over-long message (comms) | `Reminder: keep the response concise and focused on the user's request. State what you did and why in 1–2 sentences; omit restating the task and step-by-step narration.` |
+Notes:
+- `{tool_list}`, `{tool_schema}`, `{bad_fields}`, etc. are filled from `HintContext` (`available_tools`, `tool_schema`, `tool_name`) and from new optional keys (`test_name`, `assertion_excerpt`, `linter`, `rule_id`, `rule_msg`, `prior_action`, `n`).
+- `failing_test`, `lint_style`, `wasteful_action`, `verbose_communication` are **new** and extend coverage from tool-use into the style/comms/effort behavior classes at the *template tier* — but they are deliberately generic; the high-quality versions of these come from the LLM-judge (§4) or sibling-bootstrap (§2 class (f)).
+- Truncate `{assertion_excerpt}` / `{tool_schema}` to ~200 chars (matches the `source_content_excerpt[:200]` convention already used in `trace_examples.py`) to keep the injected hint short — the blog stresses the hint is **local and short**.
+---
+## 4. LLM-judge path (class (c))
+When no template fires, or when the behavior class is style/comms/effort, call a cheap judge to emit a ≤2-sentence corrective hint. The judge sees the *failed* turn and the *environment's* reaction — never the ground truth (we usually don't have it) — and is asked to produce the *minimal corrective nudge* that the teacher will then condition on.
+### 4.1 Prompt template
+```text
+SYSTEM:
+You write a single, short corrective hint for a coding agent that just made a
+mistake. The hint will be inserted into the agent's context so it can retry the
+SAME turn. Output ONE hint of AT MOST 2 sentences. Be concrete and actionable.
+Do NOT solve the task, do NOT write code, do NOT explain your reasoning. If the
+action was actually fine, output exactly: NO_HINT.
+USER:
+## Conversation state (last {k} turns)
+{state}
+## The action that went wrong
+{erroring_action}
+## What the environment returned
+{tool_output}
+## Behavior dimension to correct (one of: tool_use | style | communication | effort)
+{behavior_dim}
+Write the hint now (≤2 sentences, or NO_HINT):
+```
+- `{state}` = last `k≈3` turns (truncate to a token budget, e.g. 1.5k tokens).
+- `{erroring_action}` = the assistant turn's tool call / message that failed.
+- `{tool_output}` = the env error or judge text (the same string class (b) would pass raw).
+- `{behavior_dim}` = routed from the error-site detector (§5): structural tool errors → `tool_use`; judge-flagged turns → `style`/`communication`/`effort`.
+- `NO_HINT` sentinel maps to the generator returning `None` (skip the SDPO site), preventing the collator from minting a zero-signal row (the collator already guards "hint AND recovery content" — `data_collator.py` L308).
+### 4.2 Model tier & rough cost
+- **Tier:** a *small/cheap* instruct model is sufficient — the task is "spot the obvious mistake and say it in 2 sentences," not solve. Candidates: a 7–8B local model already loaded for rollouts (zero marginal $), or a hosted small model (e.g. Sonnet-class / GPT-mini-class via OpenRouter, consistent with the existing `hint_generator.py` docstring that names "Sonnet 4.6 or Opus 4.7 via OpenRouter" for v0.2).
+- **Cost (hosted small model):** input ≈ 1.5k–2k tok (state + action + output), output ≤ ~60 tok. At ~$0.15/M in + ~$0.60/M out that is **≈ $0.0003–0.0006 per error site**. With error sites at, say, 1–3 per trace, this is **~$0.001–0.002/trace** — an order of magnitude cheaper than the trace-replay channel's ~$0.30/trace (`COMPOSER_RECIPE_MAPPING.md` §"three reward channels"), and only paid on **template misses**.
+- **Cost (local judge):** effectively free GPU time; preferred at scale. Use the hosted path for v0.1 quality calibration, then distill to local.
+- **Caching:** hints are deterministic-enough to cache keyed on `hash(erroring_action + tool_output + behavior_dim)`; repeated identical error sites across a training run reuse the hint for free.
+---
+## 5. Error-site detection in a trace (align with `ingestion/trace_examples.py`)
+The hint generator must only fire at **error sites**. The pipeline already has two layers of structural detection that the generator must align with — do **not** invent a parallel detector.
+**Existing structural detection (authoritative, do not duplicate):**
+1. **Ingestion → `trace_examples.py`** sets `turn["tool_error"] = <error_kind>` on the assistant turn *immediately after* an error tool-result. It detects errors via:
+   - **Structural flag first** (`_user_turn_has_error`): the ingester sets `tool_error: True` on user messages whose source JSONL had `is_error: true`. **This is the source of truth.**
+   - **String-tag fallback**: matches `TOOL_ERROR_TAG = "[TOOL_RESULT (ERROR)]"` only when no structural flag is present (older traces).
+   - **`error_kind` classification** (`default_classify_error`): keyword regex → `command_not_found`, `file_not_found`, `permission_denied`, `syntax_error`, `connection_error`, else `tool_error`.
+2. **Collator → `data_collator.py`** (`_is_error_turn`): an error site iff `turn.get("tool_error") is not None`, AND it only mints an SDPO row when **both** a hint is produced **and** the recovery turn has content (L308) — so empty-recovery sites are skipped.
+**Extending the detector for the new behavior/error classes (additions, not replacements).** Keep `error_kind` as the routing key the generator already receives, and broaden the classifier so the new templates (§3) and the judge router (§4) get the right `behavior_dim`:
+| Signal in trace | Detected via | New `error_kind` → behavior_dim |
+|---|---|---|
+| Failed tool status / `is_error: true` | structural flag (existing) | `tool_error`/`tool_not_found`/… → `tool_use` |
+| Exception traceback in tool output | regex `Traceback (most recent call last)` / `Error:` | `runtime_error` → `tool_use` |
+| Malformed args / JSON | parse failure of the tool-call args | `malformed_args`/`json_decode` → `tool_use` |
+| Test runner non-zero exit / assertion | regex `FAILED|AssertionError|[0-9]+ failed` in output | `failing_test` → `tool_use` (verifiable) |
+| Linter/formatter non-zero exit | regex `{ruff|eslint|flake8|black}.*(error|would reformat)`; nonzero exit code | `lint_style` → `style` |
+| Redundant/no-op action | heuristic: action equals a prior action's signature; or no state delta | `wasteful_action` → `effort` |
+| Over-long / off-task assistant message | **LLM-judge flag only** (no structural signal) | `verbose_communication` → `communication` |
+Implementation alignment rule: **add these as new `(kind, regex)` rows to `_ERROR_KIND_PATTERNS` in `trace_examples.py`** (same ordered-precedence mechanism already there — note its comment that `command_not_found` must precede `file_not_found`), so detection stays in **one** place and the generator stays a pure `error_kind → hint` function. Style/comms/effort sites that have **no structural signature** are surfaced only by the judge and should be gated (sampled, e.g. 10–20% of clean turns) to bound cost.
+---
+## 6. Recommended layered design + `HintGenerator` Protocol
+### 6.1 Protocol
+A clean, typed Protocol that subsumes the current `dispatch` and the existing `CollatorConfig.hint_generator: Callable[[str, dict], str | None]` hook. The collator calls `generator(error_kind, error_meta)`; we wrap the Protocol with a tiny adapter so **no collator change is required**.
+```python
+# composer_replication/hints/protocol.py
+from __future__ import annotations
+from typing import Protocol, TypedDict, runtime_checkable
+class ErrorContext(TypedDict, total=False):
+    """Everything a hint source might need. Superset of the current HintContext."""
+    error_kind: str            # routing key from trace_examples classifier
+    behavior_dim: str          # "tool_use" | "style" | "communication" | "effort"
+    error_message: str         # raw env/tool error text  (enables class (b))
+    available_tools: list[str]
+    tool_name: str
+    tool_schema: dict
+    state_excerpt: str         # last-k turns, for the judge (class (c)/(d))
+    erroring_action: str       # the failed assistant turn
+    sibling_rollouts: list[dict]  # GRPO group; passing ones enable class (f)
+    repeat_count: int          # for repeated_failure
+@runtime_checkable
+class HintGenerator(Protocol):
+    """A hint source. Returns the hint text, or None to decline this site."""
+    def generate(self, ctx: ErrorContext) -> str | None: ...
+```
+### 6.2 Layered composite (template-first → judge → sibling-bootstrap)
+```python
+# composer_replication/hints/layered.py
+from dataclasses import dataclass, field
+from .protocol import HintGenerator, ErrorContext
+from . import templates          # wraps the existing HINT_TEMPLATES registry
+from . import judge              # LLM-judge generator (class (c))
+from . import sibling            # SDPO successful-sibling bootstrap (class (f))
+@dataclass
+class LayeredHintGenerator:
+    """Try each source in order; first non-None wins. A wrong hint is
+    bounded-bad (teacher is stop-grad), so cheap layers go first and we
+    only escalate to paid/learned layers on a miss."""
+    layers: list[HintGenerator] = field(default_factory=list)
+    def generate(self, ctx: ErrorContext) -> str | None:
+        for layer in self.layers:
+            hint = layer.generate(ctx)
+            if hint:                      # non-empty, non-None
+                return hint
+        return None                       # collator then skips this SDPO site
+    # Adapter for the existing CollatorConfig.hint_generator signature.
+    def as_collator_hook(self):
+        def hook(error_kind: str, error_meta: dict) -> str | None:
+            ctx: ErrorContext = {"error_kind": error_kind, **(error_meta or {})}
+            return self.generate(ctx)
+        return hook
+def default_layered(*, judge_client=None, enable_judge=True) -> LayeredHintGenerator:
+    layers: list[HintGenerator] = [
+        templates.TemplateHintGenerator(),          # (a) free, deterministic
+        templates.RawErrorHintGenerator(),          # (b) raw env error as hint
+    ]
+    if enable_judge and judge_client is not None:
+        layers.append(judge.JudgeHintGenerator(judge_client))   # (c)
+    layers.append(sibling.SiblingBootstrapGenerator())          # (f) backstop
+    return LayeredHintGenerator(layers=layers)
+```
+Wiring (unchanged collator contract):
+```python
+gen = default_layered(judge_client=my_small_model_client)
+config = CollatorConfig(hint_generator=gen.as_collator_hook(), enable_sdpo=True)
+collator = ComposerDataCollator(tokenizer=tok, config=config)
+```
+### 6.3 The three new layers (sketches)
+```python
+# (a)+(b) templates.py — reuse the EXISTING registry verbatim
+from composer_replication.hint_generator import dispatch   # current module
+class TemplateHintGenerator:
+    def generate(self, ctx):
+        return dispatch(ctx.get("error_kind", ""), ctx)     # None on unknown kind
+class RawErrorHintGenerator:
+    """Class (b): SDPO 'environment output' feedback — splice the raw error."""
+    def generate(self, ctx):
+        msg = (ctx.get("error_message") or "").strip()
+        if not msg:
+            return None
+        return f"Reminder: the previous action returned this error:\n{msg[:200]}\nFix the cause and retry."
+```
+```python
+# (c) judge.py — class (c), prompt from §4.1
+class JudgeHintGenerator:
+    def __init__(self, client, cache=None):
+        self.client, self.cache = client, (cache if cache is not None else {})
+    def generate(self, ctx):
+        key = hash((ctx.get("erroring_action"), ctx.get("error_message"),
+                    ctx.get("behavior_dim")))
+        if key in self.cache:
+            return self.cache[key]
+        hint = self.client.complete(_build_judge_prompt(ctx))  # §4.1 template
+        hint = None if hint.strip() == "NO_HINT" else hint.strip()
+        self.cache[key] = hint
+        return hint
+```
+```python
+# (f) sibling.py — SDPO 'successful rollouts as implicit feedback'
+class SiblingBootstrapGenerator:
+    """Class (f): when nothing else fired but a sibling rollout in the same
+    GRPO group PASSED, condition the teacher on that success."""
+    def generate(self, ctx):
+        sibs = ctx.get("sibling_rollouts") or []
+        winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
+        if not winners:
+            return None
+        best = max(winners, key=lambda s: s["reward"])
+        snippet = (best.get("solution_excerpt") or "")[:200]
+        return ("Reminder: a working approach for this task looks like:\n"
+                f"{snippet}\nAdapt this to the current step.")
+```
+> **Note on class (d):** same-model introspection (OPSD `--reason_first`) is the *training model* critiquing its own turn — best implemented inside the trainer (where the model is loaded) rather than the collator, since it needs a model forward pass. Add it as a fourth layer once the trainer exposes a `self_critique(ctx) -> str` callable; the Protocol already supports it. For v0.1, the judge (c) is the simpler stand-in for the same role.
+### 6.4 Why this order (decision summary)
+1. **Templates + raw-error (a/b)** are free, deterministic, and cover the **tool-use** class — the bulk of structural error sites. They reproduce Cursor's one published example (`"Reminder: Available tools are…"`) exactly.
+2. **Judge (c)** is the only layer that *manufactures* a corrective for **style / communication / effort**, the behavior classes the mapping doc flags as the real test of the recipe (`COMPOSER_RECIPE_MAPPING.md` §"point 2"). Gated + cached → ~$0.0005/site, paid only on template miss.
+3. **Sibling-bootstrap (f)** is the SDPO-native fallback when there's no template, no judge (or judge declined), but the rollout group contains a winner — *free* privileged info from the model's own success. This is the lever `09-composer-blog-delta-2026.md` §3 action item 3 told us to record.
+4. **Learned generator (e)** drops in as a new layer in v0.2 (`COMPOSER_RECIPE_MAPPING.md` table row (d): "+ learned hint generator") without touching the Protocol or the collator.
+---
+## 7. Implementation handles (v0.1)
+Concrete, ordered work items. Everything below preserves the existing `CollatorConfig.hint_generator: Callable[[str, dict], str | None]` contract — **no collator surgery**.
+1. **Keep `hint_generator.py` as layer (a).** It already implements 5 templates with the right `dispatch(error_kind, ctx) -> str | None` signature. Add the four new templates from §3 (`failing_test`, `lint_style`, `wasteful_action`, `verbose_communication`) via `register(...)`. **Actual strings shipped in §3** — copy verbatim.
+2. **Add the new error_kind regexes to `_ERROR_KIND_PATTERNS` in `trace_examples.py`** (§5 table). Single source of detection truth; preserve the ordered-precedence comment pattern (`command_not_found` before `file_not_found`). Route each `error_kind → behavior_dim` so the judge gets correct routing.
+3. **Build `composer_replication/hints/`** with `protocol.py`, `layered.py`, `templates.py`, `judge.py`, `sibling.py` (§6 sketches). `templates.py` *imports the existing `dispatch`* — do not reimplement.
+4. **Wire via the adapter:** `CollatorConfig(hint_generator=default_layered(...).as_collator_hook())`. The `claude_states_to_trace_examples` adapter already populates `error_meta` (`source_content_excerpt[:200]`); extend it to also stash `error_message` (for class (b)) and, when available from the GRPO loop, `sibling_rollouts` (for class (f)).
+5. **Borrow OPSD stabilizers for the loss side:** when distilling style/comms hints, apply **per-token JSD clipping** (OPSD `--jsd_token_clip`, default `0.05`) so "stylistic tokens" don't dominate — the README states this is exactly why it exists. Pair with the collator's existing `sdpo_loss_mask` (post-hint tokens only).
+6. **Gate the judge:** fire (c) only on (i) template miss, or (ii) a sampled fraction (~10–20%) of clean turns flagged for style/comms/effort, with hint caching keyed on `(erroring_action, error_message, behavior_dim)`. Bounds cost at ~$0.001–0.002/trace.
+7. **Eval the generator independently of training** (matches `COMPOSER_RECIPE_MAPPING.md` concern that "SDPO with hardcoded templates is the easy case"): measure (a) % of error sites that get a non-None hint per layer, (b) teacher-vs-student KL *increase* at hinted turns (a good hint should *raise* divergence — it's shifting probability toward the fix, per the blog's "lowering wrong-tool, raising valid-replacement"), and (c) for style/comms, a held-out judge agreeing the hint is corrective. A hint that doesn't move the teacher distribution is a no-op and should be pruned.
+---
+## 8. Citations
+- **SDPO** — Hübotter, Lübeck, Behric, Baumann, Bagatella, Marta, Hakimi, Shenfeld, Kleine Buening, Guestrin, Krause (ETH Zürich), *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802v2](https://arxiv.org/abs/2601.20802) (v1 28 Jan 2026, v2 16 Feb 2026), CC-BY-4.0. Abstract + method/ablation HTML ([html v2](https://arxiv.org/html/2601.20802v2)). The three-feedback-type ablation (sample solution / environment output / student's original attempt) and the "successful rollouts as implicit feedback" claim are the load-bearing sources for §1.2 and taxonomy classes (b), (d), (f).
+- **OPSD** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models*, [arXiv:2601.18734v3](https://arxiv.org/abs/2601.18734), code [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) (paper CC-BY-4.0; verify code license on repo). Privileged-info teacher (`y⋆` = ground-truth/reference CoT), Eq. 8 loss, stop-grad teacher, and the `--reason_first` (introspection) + `--jsd_token_clip` (stylistic-token stabilizer) flags are the sources for §1.1 and the OPSD handles in §7.
+- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026): the "Reminder: Available tools are…" example, "hint changes the teacher probabilities / update student weights for that turn only," and the three behavior targets (tool use, coding style, model communication). Via `docs/COMPOSER_RECIPE_MAPPING.md` §1 and `research/09-composer-blog-delta-2026.md` §2.
+- **Composer 2 technical report** — [arXiv:2603.24477](https://arxiv.org/abs/2603.24477) / [Composer2.pdf](https://cursor.com/resources/Composer2.pdf) (Rush et al.): flagged in the delta doc as the most likely place to resolve hint-generation directly; **still unread** — a dedicated extraction is the recommended follow-up if this design needs validation against Cursor's actual mechanism.
+- **In-repo (audited):** `composer_replication/hint_generator.py` (current layer (a)), `composer_replication/trainer/data_collator.py` (`CollatorConfig.hint_generator` hook, `_build_hint_injected_trace`, hint-AND-recovery gate at L308), `composer_replication/ingestion/trace_examples.py` (structural error detection, `_ERROR_KIND_PATTERNS`, `default_classify_error`).
+> **Residual gap:** Cursor still never states which hint source they use; this design *brackets* their unknown choice with the OPSD (privileged-answer) and SDPO (environment-feedback + sibling-bootstrap) endpoints and makes all of them composable. The one unread artifact that could collapse the bracket is Composer2.pdf (arXiv:2603.24477).

research/08-sdpo-grpo-integration.md ADDED Viewed

	@@ -0,0 +1,499 @@

+# SDPO ⊕ Dr. GRPO: wiring the on-policy KL-at-error-turns channel into a live RL loop
+> **Design date:** 2026-05-28.
+> **Scope:** A concrete, implementable design for adding the SDPO auxiliary
+> loss channel (on-policy KL at error turns, teacher = same weights conditioned
+> on a hint) as a **second loss head** on a live **Dr. GRPO** update step. Targets
+> the two integration substrates already in this repo: the **PRIME-RL parity
+> recipe** (`recipes/prime_rl/composer_loss.py`) and the **TRL `GRPOTrainer`
+> subclass** (`trainer/composer_trainer.py`). Recommends the TRL subclass as the
+> host and gives a ~70-LoC `ComposerGRPOTrainer` sketch.
+> **Method:** Lead with local-file analysis of `loss.py`, `composer_loss.py`,
+> `composer_trainer.py`, `data_collator.py`, plus `research/07` (HintGenerator)
+> and `research/10` (the Dr. GRPO target). One bounded TRL API lookup
+> (`mcp_exa_get_code_context_exa` on `huggingface/trl@main`) to confirm the
+> `GRPOTrainer` loss-override surface; the DeepWiki follow-up timed out, so the
+> version-robust guard in §4 documents both the `_compute_loss(self, model,
+> inputs)` internal hook (what this repo already overrides) and the public
+> `compute_loss(self, model, inputs, return_outputs=False,
+> num_items_in_batch=None)` HF-Trainer wrapper.
+---
+## TL;DR
+SDPO is **not** the GRPO-KL-to-reference term and must not be folded into it. It
+is a **separate distillation head**: a generalized-JSD between the student's
+on-policy logits and the **same model's** logits when its context has a hint
+spliced in at the error turn, masked to the post-hint recovery tokens. The
+integration is therefore "compute the Dr. GRPO loss as usual, then **add
+`beta_sdpo · JSD_error_turns`** before `.backward()`."
+- **Host = the TRL `GRPOTrainer` subclass.** It already exists
+  (`ComposerReplicationTrainer`), already overrides the loss with exactly this
+  `grpo + alpha*sdpo + beta*replay` shape, and — decisively — it has **full
+  logits** in `_compute_loss`. The PRIME-RL recipe **cannot** host SDPO today:
+  its `LossInputs` exposes per-token **log-probs only, not full vocabulary
+  logits**, and `composer_loss.py` correctly raises `NotImplementedError` when
+  `alpha_sdpo>0`. SDPO needs the full distribution; PRIME-RL is blocked until
+  upstream exposes logits.
+- **Attach point:** inside the Dr. GRPO update step, after the policy-gradient +
+  k1-KL loss is computed on the minibatch, run one **student forward (grad)** +
+  one **teacher forward (`no_grad`, hint-spliced context)**, take
+  `generalized_jsd_loss` masked to `sdpo_loss_mask`, scale by `beta_sdpo`, and
+  add. Single-epoch Dr. GRPO makes this clean: the teacher forward happens on
+  **the same minibatch being updated**, so the KL is genuinely on-policy.
+- **Dr. GRPO specifics are preserved untouched:** SDPO touches neither the
+  advantage estimator (no std-norm, no length-standardization) nor the GRPO
+  **k1** (`−log r`) KL-to-ref. It is purely additive.
+- **CPU-testable:** a 1–2 rollout Dr. GRPO step on Qwen2.5-0.5B with the SDPO
+  channel on, mirroring the existing `examples/sdpo_real_trace_train_smoke`.
+---
+## 1. The two in-repo substrates, and why TRL is the host
+### 1.1 Substrate A — TRL `GRPOTrainer` subclass (`trainer/composer_trainer.py`)
+Already in the repo and already the right shape. `ComposerReplicationTrainer`
+subclasses `trl.GRPOTrainer` and overrides:
+```python
+def _compute_loss(self, model, inputs) -> torch.Tensor:
+    grpo_loss  = super()._compute_loss(model, inputs)         # channel 1
+    sdpo_kl    = self._compute_sdpo_loss(model, inputs)       # channel 2
+    replay_dpo = self._compute_trace_replay_loss(model, inputs)
+    return grpo_loss + self.alpha_sdpo*sdpo_kl + self.beta_replay*replay_dpo
+```
+`_compute_sdpo_loss` (lines 133–178) already does the **student forward (grad) +
+teacher forward (`no_grad`) over `ctx_teacher_input_ids`**, the
+`student_logits.shape == teacher_logits.shape` gate, and
+`generalized_jsd_loss(..., labels=inputs["sdpo_loss_mask"], beta, temperature,
+token_clip, reduction="batchmean")`. This is the SDPO channel, intact. **It has
+full logits** — the prerequisite PRIME-RL lacks.
+**Decisive property:** TRL hands the subclass `model` and `inputs` and lets it
+return any scalar; full `.logits` are available for both the student and the
+hint-conditioned teacher forward. SDPO is a drop-in.
+### 1.2 Substrate B — PRIME-RL parity recipe (`recipes/prime_rl/composer_loss.py`)
+PRIME-RL's `CustomLossConfig` takes an importable `loss_fn(inputs:
+LossInputs)` called **once per sample** on **1-D `(seq,)` tensors**. Channel 1
+(DPPO + k1-style KL on the importance ratio) is **byte-for-byte parity-verified**
+against upstream `default_loss_fn` and is an excellent Dr.-GRPO-adjacent PG loss.
+But SDPO is **deferred by construction**:
+```python
+# composer_loss.py, lines 257-268
+teacher_lp = getattr(inputs, "teacher_logprobs", None)
+if alpha_sdpo > 0:
+    raise NotImplementedError(
+        "SDPO channel in the PRIME-RL recipe is deferred. PRIME-RL v0.5 "
+        "exposes (seq,) log-probs through LossInputs but not full vocabulary "
+        "logits, and SDPO/OPSD requires the full distribution. ...")
+```
+`generalized_jsd_loss` calls `log_softmax(dim=-1)` over the vocab axis. With
+only a `(seq,)` log-prob vector there is **no vocab axis** — softmax of a
+1-element slice is identically 1.0 and `log` is 0, i.e. a mathematically
+degenerate, silently-zero channel (the Wave-13 finding the docstring cites). So
+SDPO in PRIME-RL is blocked **until upstream exposes per-token full logits**, not
+a thing we can paper over.
+### 1.3 Recommendation
+**Host the SDPO aux channel in the TRL `GRPOTrainer` subclass.** Rationale:
+1. **Logits available** — the one hard requirement SDPO has and PRIME-RL lacks.
+2. **The override already exists** with the exact additive shape; we
+   re-point channel 1 at Dr. GRPO and tighten the teacher forward (§4).
+3. **Single-process, CPU-runnable** — matches the existing smoke harness, so the
+   SDPO-on Dr.-GRPO step is testable today (§6) without PRIME-RL's 3-actor mesh.
+4. PRIME-RL stays the **scale/parity** path for channel-1-only runs; SDPO lands
+   there for free the moment `LossInputs.teacher_logits` (full distribution)
+   exists upstream — the adapter is otherwise ready.
+> One caveat to fix while we're here: the current `ComposerReplicationTrainer`
+> channel 1 is *vanilla* GRPO (`super()._compute_loss`). The Composer target is
+> **Dr. GRPO** (`research/10`): length-standardization removed, **no std-dev
+> advantage normalization**, **k1** (`−log r`) KL, Adam, single-epoch. §3 + §4
+> pin those into the subclass; SDPO rides on top unchanged.
+---
+## 2. The exact attach point + data flow
+SDPO attaches **inside one Dr. GRPO update step, after the PG+KL loss is formed,
+before backward**. It is one extra additive scalar. Concretely, per minibatch:
+```
+            ┌─────────────────────── one Dr. GRPO update step (single-epoch) ──────────────────────┐
+rollout ──▶ │  Channel 1 (Dr. GRPO):                                                                │
+trajectory  │    advantages = (R - group_mean)          # NO /std, NO length-standardization        │
+(group of K)│    logπ_new = model(input_ids).logprobs   # the on-policy student forward (grad)      │
+            │    log_r     = logπ_new - logπ_old         # log importance ratio (old = rollout-time) │
+            │    pg   = -(advantages * exp(log_r))[resp_mask]                                        │
+            │    kl   = (-log_r)[resp_mask]              # k1 estimator, NOT k3                       │
+            │    L_drgrpo = (pg + beta_kl * kl).sum()                                                │
+            │                                                                                        │
+            │  Channel 2 (SDPO) — SAME minibatch, reuses the student forward where possible:         │
+            │    error sites  ◀── reuse ingestion structural `tool_error` (research/07 §5)           │
+            │       │            (turn.get("tool_error") is not None; single source of truth)        │
+            │       ▼                                                                                 │
+            │    HintGenerator.generate(ErrorContext)  ──▶ hint text  (research/07 §6, layered)      │
+            │       │                                                                                 │
+            │       ▼  data_collator splices hint at the error turn:                                 │
+            │    ctx_teacher_input_ids  (hint system-msg + recovery turn, chat-template aligned)     │
+            │    input_ids              (placeholder-of-equal-token-length so shapes match)          │
+            │    sdpo_loss_mask         (1 on post-hint recovery tokens only)                         │
+            │       │                                                                                 │
+            │       ▼                                                                                 │
+            │    student_logits = model(input_ids).logits                 # grad                      │
+            │    with no_grad: teacher_logits = model(ctx_teacher_input_ids).logits   # stop-grad     │
+            │    L_sdpo = generalized_jsd_loss(student, teacher,                                      │
+            │                labels=sdpo_loss_mask, beta=jsd_beta,                                    │
+            │                temperature=1.0, token_clip=0.05)           # masked to error turn       │
+            │                                                                                        │
+            │  total = L_drgrpo + beta_sdpo * L_sdpo      ──▶ .backward()  ──▶ Adam.step()            │
+            └────────────────────────────────────────────────────────────────────────────────────────┘
+```
+Key flow facts:
+- **Error-site detection is not re-invented.** The ingestion layer already sets
+  `turn["tool_error"] = <error_kind>` (structural `is_error:true` flag first,
+  string-tag fallback), and the collator's `_is_error_turn` keys on exactly that
+  (`research/07` §5). The trainer **consumes** the collator's
+  `ctx_teacher_input_ids` / `sdpo_loss_mask`; it does not detect errors itself.
+- **HintGenerator is called at collation time**, not in the loss. Per
+  `research/07` §6.1, the generator's only job is to produce the text spliced
+  into the teacher context; the collator's `_build_hint_injected_trace` does the
+  splice and the equal-length student alignment
+  (`_build_aligned_student_for_sdpo`). The trainer sees finished tensors.
+- **The teacher forward is on the live weights**, hint-conditioned, `no_grad`.
+  It is *not* a separate model and *not* a re-rollout (`research/07` §1.3). One
+  extra forward per SDPO minibatch.
+- **The JSD is masked to the error turn** via `sdpo_loss_mask` (post-hint
+  recovery tokens only), so SDPO supervises *exactly* the turn the hint targets,
+  leaving the rest of the trajectory to channel 1.
+---
+## 3. Reconciling with Dr. GRPO specifics
+`research/10` pins the algorithm. SDPO must coexist without perturbing any of it:
+| Dr. GRPO property (`research/10` §2) | Where it lives | SDPO interaction |
+|---|---|---|
+| **No std-dev advantage normalization** | advantage estimator | **None.** SDPO never touches advantages. Keep `A = R - group_mean` (no `/std`). |
+| **Length-standardization term removed** | PG reduction | **None.** SDPO is a separate head; do not re-introduce a `1/|y|` factor via SDPO's reduction either (use `batchmean` over masked error-turn tokens, which is SDPO's own normalization, independent of trajectory length). |
+| **k1 KL = `−log r`** (NOT k3) | GRPO KL-to-ref term | **Distinct from SDPO.** The GRPO k1 KL regularizes the policy toward the *reference/old* policy on all response tokens. SDPO's JSD pulls the policy toward the *hint-conditioned self-teacher* on error-turn tokens. Two different targets, two different token sets, two different weights (`beta_kl` vs `beta_sdpo`). Never merge them. |
+| **Single-epoch (a prompt is never trained twice)** | outer loop | **This is what makes SDPO clean.** The teacher forward happens on the *same minibatch being updated this step* — the student logits and the hint-conditioned teacher logits are both from the current weights on the current rollout, so the distilled KL is genuinely **on-policy** (SDPO's defining property). No stale-teacher / replay-buffer drift to reconcile. |
+| **Adam, full-parameter, async rollouts** | optimizer / infra | **None.** SDPO adds gradient only through the student forward; Adam consumes the summed gradient transparently. Async/off-policy weight sync (PipelineRL-style) affects channel 1's `logπ_old`; SDPO's teacher is the *current* weights so it is unaffected. |
+**The one thing to get right:** SDPO's JSD is **SEPARATE** from the GRPO
+KL-to-ref. In the loss expression `total = L_drgrpo + beta_sdpo*L_sdpo`, the
+`L_drgrpo` already *contains* its own `beta_kl * k1_kl`. Do not let `beta_sdpo`
+masquerade as a KL coefficient or vice-versa; they are logged separately
+(`loss/grpo_kl` vs `loss/sdpo_jsd`).
+---
+## 4. Implementation handles — `ComposerGRPOTrainer(GRPOTrainer)`
+A focused subclass that (a) forces channel 1 into the Dr. GRPO regime and (b)
+adds the SDPO head. This refines the existing `ComposerReplicationTrainer`; the
+SDPO method is lifted almost verbatim from `composer_trainer.py:_compute_sdpo_loss`
+(it is already correct), and the Dr. GRPO config is pinned via `GRPOConfig`.
+### 4.1 The loss-override surface (version-robust)
+The repo already overrides `_compute_loss(self, model, inputs)` — the internal
+per-step loss hook TRL's `GRPOTrainer` exposes, and what this subclass keeps
+using. Recent TRL wraps that in the HF `Trainer.compute_loss(self, model,
+inputs, return_outputs=False, num_items_in_batch=None)`. To be robust to either
+surface, override **`_compute_loss`** (present across the versions this repo
+targets) and additionally provide a thin `compute_loss` shim that delegates, so
+the subclass works whether TRL calls the internal or the public method:
+```python
+def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
+    loss = self._compute_loss(model, inputs)            # our composed loss
+    return (loss, None) if return_outputs else loss
+```
+If a future TRL drops `_compute_loss`, move the channel-1 call to
+`super().compute_loss(model, inputs, return_outputs=True,
+num_items_in_batch=num_items_in_batch)[0]` inside `_compute_loss` — the SDPO
+add-on is unaffected.
+### 4.2 The sketch (~70 LoC)
+```python
+# composer_replication/trainer/composer_grpo_trainer.py
+from __future__ import annotations
+from typing import Any
+import logging, torch
+try:
+    from trl import GRPOTrainer, GRPOConfig            # noqa: F401
+    _TRL = True
+except ImportError:                                    # doc/test import without TRL
+    GRPOTrainer = object; _TRL = False
+from composer_replication.opsd import generalized_jsd_loss
+logger = logging.getLogger(__name__)
+def make_dr_grpo_config(**overrides: Any) -> "GRPOConfig":
+    """Dr. GRPO regime (research/10 §2): no std-norm, no length-standardization,
+    k1 KL, single-epoch, Adam. We pin what GRPOConfig exposes and assert the
+    rest. TRL flag names drift across versions, so set defensively + log."""
+    cfg_kwargs = dict(
+        num_iterations=1,            # single-epoch: a prompt is never re-trained
+        scale_rewards=False,         # << NO std-dev advantage normalization (Dr. GRPO)
+        loss_type="dr_grpo",         # TRL's Dr. GRPO loss_type: drops length-standardization;
+                                     #   if absent in your TRL, fall back to "grpo" and
+                                     #   override the reduction (see assert below).
+        optim="adamw_torch",         # Adam(W); Composer 2 uses Adam for RL
+        beta=0.0,                    # GRPO KL-to-ref coeff; set >0 to enable the k1 term
+    )
+    cfg_kwargs.update(overrides)
+    return GRPOConfig(**cfg_kwargs)
+class ComposerGRPOTrainer(GRPOTrainer):  # type: ignore[misc,valid-type]
+    """Dr. GRPO + SDPO (on-policy KL at error turns). SDPO is an ADDITIVE head;
+    it never touches advantages or the GRPO-KL-to-ref term."""
+    def __init__(self, *args: Any, beta_sdpo: float = 0.0, sdpo_jsd_beta: float = 0.5,
+                 sdpo_temperature: float = 1.0, sdpo_token_clip: float | None = 0.05,
+                 sdpo_warmup_steps: int = 0, beta_sdpo_max: float | None = None,
+                 **kwargs: Any):
+        if not _TRL:
+            raise ImportError("ComposerGRPOTrainer requires TRL: pip install -e .[train]")
+        super().__init__(*args, **kwargs)
+        self.beta_sdpo = beta_sdpo
+        self.beta_sdpo_max = beta_sdpo_max if beta_sdpo_max is not None else beta_sdpo
+        self.sdpo_warmup_steps = sdpo_warmup_steps
+        self.sdpo_jsd_beta = sdpo_jsd_beta
+        self.sdpo_temperature = sdpo_temperature
+        self.sdpo_token_clip = sdpo_token_clip
+        # Dr. GRPO sanity pins (loud, not silent): if the TRL version ignored a
+        # flag, surface it rather than train vanilla GRPO by accident.
+        if getattr(self.args, "scale_rewards", True):
+            logger.warning("Dr. GRPO requires scale_rewards=False (no std-norm); "
+                           "GRPOConfig.scale_rewards=%s — advantages may be std-normalized.",
+                           getattr(self.args, "scale_rewards", None))
+    def _beta_sdpo_now(self) -> float:
+        """Linear warmup so SDPO doesn't swamp the early policy gradient (§5)."""
+        step = getattr(getattr(self, "state", None), "global_step", 0) or 0
+        if self.sdpo_warmup_steps <= 0:
+            return self.beta_sdpo
+        frac = min(1.0, step / float(self.sdpo_warmup_steps))
+        return self.beta_sdpo + frac * (self.beta_sdpo_max - self.beta_sdpo)
+    def _compute_loss(self, model, inputs):
+        drgrpo = super()._compute_loss(model, inputs)          # channel 1 (Dr. GRPO, k1 KL)
+        sdpo   = self._compute_sdpo_loss(model, inputs)        # channel 2 (additive)
+        beta   = self._beta_sdpo_now()
+        total  = drgrpo + beta * sdpo
+        if self.state.global_step % getattr(self.args, "logging_steps", 50) == 0:
+            self.log({"loss/grpo": float(drgrpo.detach()),
+                      "loss/sdpo_jsd": float(sdpo.detach()),
+                      "loss/beta_sdpo": beta, "loss/total": float(total.detach())})
+        return total
+    def _compute_sdpo_loss(self, model, inputs):
+        if (self._beta_sdpo_now() == 0.0
+                or "ctx_teacher_input_ids" not in inputs
+                or inputs["ctx_teacher_input_ids"].numel() == 0):
+            return torch.zeros((), device=next(model.parameters()).device, requires_grad=True)
+        student = model(input_ids=inputs["input_ids"]).logits           # grad
+        with torch.no_grad():
+            teacher = model(input_ids=inputs["ctx_teacher_input_ids"]).logits   # stop-grad
+        if student.shape != teacher.shape:                              # collator alignment guard
+            logger.warning("SDPO shape mismatch student=%s teacher=%s; skipping step.",
+                           student.shape, teacher.shape)
+            return torch.zeros((), device=student.device, requires_grad=True)
+        return generalized_jsd_loss(student_logits=student, teacher_logits=teacher,
+                                    labels=inputs.get("sdpo_loss_mask"),
+                                    beta=self.sdpo_jsd_beta, temperature=self.sdpo_temperature,
+                                    token_clip=self.sdpo_token_clip, reduction="batchmean")
+    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
+        loss = self._compute_loss(model, inputs)
+        return (loss, None) if return_outputs else loss
+```
+### 4.3 How error-turn batches reach the trainer
+**Reuse `ComposerDataCollator` verbatim** — it already emits
+`ctx_teacher_input_ids` + `sdpo_loss_mask` and (critically) the
+**equal-length student** via `_build_aligned_student_for_sdpo` (the placeholder
+trick that keeps `student_logits.shape == teacher_logits.shape` so the JSD gate
+passes; the Gemini-W19 alias bug is already handled there). Wiring:
+```python
+gen = default_layered(judge_client=small_model).as_collator_hook()   # research/07 §6
+collator = ComposerDataCollator(tokenizer=tok,
+             config=CollatorConfig(hint_generator=gen, enable_sdpo=True,
+                                   enable_replay_dpo=False))
+trainer = ComposerGRPOTrainer(model=model, args=make_dr_grpo_config(...),
+             train_dataset=ds, data_collator=collator,
+             beta_sdpo=0.1, sdpo_warmup_steps=50, sdpo_token_clip=0.05,
+             reward_funcs=[my_rlvr_reward])
+```
+> **GRPO-rollout vs collator note.** TRL's `GRPOTrainer` generates rollouts
+> internally and forms its own `inputs` (prompt + completions + advantages). For
+> SDPO the error sites come from the *rollout trajectory itself* (tool errors in
+> the just-generated completions), so the SDPO tensors must be built **from the
+> live rollout**, not from a static dataset. Two equivalent integration modes:
+> (1) **post-rollout hook** — override `_generate_and_score_completions` (or the
+> rollout collation step) to run the structural `tool_error` detector +
+> `HintGenerator` + `ComposerDataCollator._build_sdpo_fields` on the generated
+> completions and stash `ctx_teacher_input_ids`/`sdpo_loss_mask` into `inputs`;
+> (2) **offline-trace mode** (what the smoke uses) — feed pre-ingested
+> error-bearing traces through the collator as the dataset, exercising the exact
+> loss path on CPU. Mode (2) is the test; mode (1) is production. The
+> `_compute_sdpo_loss` body is identical for both — it only reads the two SDPO
+> keys.
+---
+## 5. Weighting, scheduling, and guardrails
+So SDPO informs without swamping the policy gradient:
+1. **Scale.** Start `beta_sdpo = 0.1` (the library default `alpha_sdpo`), not the
+   `1.0` the smoke uses (the smoke over-weights deliberately to *prove the path
+   fires*). The Dr. GRPO PG loss is a `sum()` over response tokens; SDPO is a
+   `batchmean` JSD over error-turn tokens — different magnitudes. **Normalize
+   first:** log `loss/grpo` and `loss/sdpo_jsd` separately for the first ~50
+   steps and pick `beta_sdpo` so `beta_sdpo·sdpo_jsd ≈ 0.1–0.3 × |grpo|` at
+   steady state. Do **not** assume `0.1` is calibrated across reductions.
+2. **Warmup.** Linear `beta_sdpo` warmup over `sdpo_warmup_steps` (50–200) via
+   `_beta_sdpo_now()`. Early in training the policy is far from any sensible
+   distribution; a strong distillation pull then fights exploration. Let Dr. GRPO
+   establish a reward signal, then ramp SDPO in.
+3. **Per-token JSD clip = 0.05** (`sdpo_token_clip`, the OPSD `--jsd_token_clip`
+   default, `research/07` §1.1/§7). Prevents a few high-divergence **stylistic**
+   tokens at the error turn from dominating the distillation gradient — exactly
+   what it exists for.
+4. **Mask discipline.** SDPO supervises **only** `sdpo_loss_mask` tokens
+   (post-hint recovery). If the mask is all-ignore (empty-recovery error site,
+   ~67% of real Claude traces under `strip_thinking`), the collator already drops
+   the row (`data_collator.py` L308) — the channel silently no-ops rather than
+   emitting a degenerate ~ln(2) signal.
+**KL-explosion / teacher-student-drift guardrails:**
+- **SDPO drift is bounded-by-construction.** Teacher = same weights + hint,
+  stop-grad. A *wrong* hint produces a noisier target at one masked turn, not a
+  corrupted reward (`research/07` §1.3). There is no replay buffer and no
+  separate teacher to drift apart — single-epoch keeps teacher and student on the
+  same weights.
+- **Watch `loss/sdpo_jsd` for collapse-to-zero or blow-up.** A *good* hint should
+  *raise* divergence at the hinted turn (it shifts mass toward the fix); a
+  persistently ~0 JSD means the hint isn't moving the teacher (prune that hint
+  source, `research/07` §7 item 7). A diverging JSD means the clip is too loose or
+  `beta_sdpo` too high — cap `beta_sdpo` and/or lower `token_clip`.
+- **Guard the GRPO k1 KL independently.** Dr. GRPO's own `beta` (KL-to-ref) is
+  the explosion guard for the *policy*; keep it at its tuned value. SDPO's `beta_sdpo`
+  must not be conflated with it (§3). If total loss NaNs, bisect by zeroing
+  `beta_sdpo` — if it persists, the bug is in channel 1, not SDPO.
+- **Shape-gate is a hard stop, logged.** If collator alignment regresses,
+  `_compute_sdpo_loss` skips the step with a warning rather than training on
+  aliased pad tokens (the silent-degenerate failure mode).
+---
+## 6. CPU-testable vs GPU-only, and the smoke plan
+### What is CPU-testable
+- **The whole SDPO loss path** — student forward + hint-conditioned teacher
+  forward + masked JSD + `.backward()` + `Adam.step()` — on **Qwen2.5-0.5B** with
+  1–2 error-bearing rollouts. This is *exactly* what
+  `examples/sdpo_real_trace_train_smoke/run.py` already proves for the free
+  `compose_loss` composer; the new test wraps it in the Dr. GRPO step.
+- **The additive composition** `total = drgrpo + beta_sdpo·sdpo` and the warmup
+  schedule (assert `beta_sdpo` ramps, assert `loss/sdpo_jsd>0` on ≥1 step, assert
+  a watched param moves).
+- **Dr. GRPO config pins** — assert `scale_rewards=False`, `num_iterations=1`,
+  k1-KL path selected (unit-level, no GPU).
+### What is GPU-only
+- **Real TRL `GRPOTrainer` rollout generation** (vLLM/transformers generation at
+  batch size + group size K) — too slow on CPU for a live step; this is the
+  production "mode (1)" path in §4.3.
+- **Async weight sync / off-policy control**, MoE router replay, multi-region
+  infra (`research/10` §4) — all out of scope for the SDPO channel test.
+- **Convergence / quality** (does SDPO actually improve error-recovery) — needs a
+  real RL run.
+### Minimal smoke plan (`examples/sdpo_drgrpo_step_smoke/run.py`)
+Analogous to the existing SDPO smoke; gates:
+1. Build a Dr. GRPO minibatch from 1–2 ingested **error-bearing** Qwen traces via
+   `ComposerDataCollator` (reuse `_discover_error_sessions` + the layered
+   `HintGenerator`); assert `sdpo_loss_mask` has ≥1 in-loss position.
+2. Construct a **synthetic Dr. GRPO channel-1 loss** standing in for
+   `super()._compute_loss` (advantages = `R - group_mean`, **no /std**; k1 KL
+   `−log r`; `sum()` reduction; no length-standardization) so the test runs
+   **without** spinning up TRL's full rollout machinery on CPU — mirrors how the
+   existing smoke uses LM-CE as the GRPO stub. Optionally also run a real
+   `GRPOTrainer._compute_loss` path under a `@pytest.mark.gpu` guard.
+3. `total = drgrpo_stub + beta_sdpo · _compute_sdpo_loss(...)`; `.backward()`;
+   `Adam.step()`.
+4. **Gates (exit 0 = PASS):** (a) all losses finite across steps; (b)
+   `loss/sdpo_jsd > 0` on ≥1 step (SDPO fired — shape-gate passed, hint
+   contributed real signal); (c) a watched parameter moved; (d) `beta_sdpo`
+   warmup increases monotonically; (e) zeroing `beta_sdpo` reproduces the pure
+   Dr. GRPO stub loss bit-for-bit (proves SDPO is purely additive). Exit 2 = SKIP
+   (no error-bearing sessions / no chat-template model), matching the existing
+   smoke's contract.
+This is ~$0, CPU, single-process, and closes the one unproven edge: **a live
+Dr. GRPO update step with the SDPO channel on**, end-to-end on a real HF model.
+---
+## 7. Citations
+- **In-repo (authoritative substrate):** `composer_replication/loss.py`
+  (`compose_loss` 3-channel composer + `generalized_jsd_loss` call);
+  `recipes/prime_rl/composer_loss.py` (PRIME-RL adapter; SDPO `NotImplementedError`
+  at L257-268; parity-verified channel 1); `recipes/prime_rl/prime_rl_recipe.md`
+  (LossInputs shape, log-probs-not-logits limitation);
+  `trainer/composer_trainer.py` (`ComposerReplicationTrainer._compute_loss` and
+  `_compute_sdpo_loss` — the existing, correct SDPO head);
+  `trainer/data_collator.py` (`ctx_teacher_input_ids` + `sdpo_loss_mask` +
+  `_build_aligned_student_for_sdpo` equal-length alignment; hint-AND-recovery
+  gate L308); `examples/sdpo_real_trace_train_smoke/run.py` (the proven CPU
+  forward+backward+step harness this design's smoke extends).
+- **`research/10-composer2-techreport-mining.md`** — the Dr. GRPO target:
+  length-standardization removed, no std-dev advantage normalization, **k1**
+  (`−log r`) KL not k3, Adam, single-epoch (a prompt never trained twice).
+  arXiv:2603.24477 §4.1.
+- **`research/07-sdpo-hint-generator.md`** — `HintGenerator` Protocol + layered
+  composite, error-site detection alignment with ingestion `tool_error`, OPSD
+  `--jsd_token_clip` stabilizer, the "wrong hint is bounded-bad" property.
+- **SDPO** — arXiv:2601.20802 (on-policy self-distillation; teacher = same model
+  conditioned on feedback, student stop-grad-free / teacher stop-grad, per-token
+  KL on the student trajectory). **OPSD** — arXiv:2601.18734 (privileged-info
+  teacher, generalized-JSD, token clip).
+- **TRL** — `huggingface/trl@main` `trl/trainer/grpo_trainer.py`
+  (`GRPOTrainer` loss-override surface; confirmed via
+  `mcp_exa_get_code_context_exa`). The `_compute_loss(self, model, inputs)`
+  internal hook is what this repo already overrides; the public
+  `compute_loss(self, model, inputs, return_outputs=False,
+  num_items_in_batch=None)` HF-Trainer wrapper is shimmed in §4.1 for
+  version-robustness. (A confirmatory DeepWiki lookup timed out; the §4.1 guard
+  is written to work under either surface.)
+```

research/09-composer-blog-delta-2026.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# Composer 2.5 Blog — Delta Note (re-extraction)
+> **Re-extraction date:** 2026-05-28.
+> **Method:** Live re-pull of [cursor.com/blog/composer-2-5](https://cursor.com/blog/composer-2-5) and [the Composer 2 technical-report blog](https://cursor.com/blog/composer-2-technical-report) via `mcp_tavily_tavily_extract` (advanced); arXiv abstract pulls for the three footnote-1 papers; one Tavily secondary-source sweep (Jake Handy, DataCamp, Pulse2, Kingy, TechTalks).
+> **Scope:** DELTAS ONLY vs `docs/COMPOSER_RECIPE_MAPPING.md` (2026-05-25), focused on (a) data generation / synthetic tasks / data mix / CPT data, and (b) targeted-textual-feedback / on-policy distillation. Verbatim blog text in the mapping doc is **not** re-derived here.
+## TL;DR
+The **2.5 blog body is byte-for-byte unchanged** from what the mapping doc captured — no edits to the three method sections. All deltas come from (1) the **Composer 2 technical-report blog**, which is now cited as the K2.5 base-model source and which the mapping doc only listed as a stub "verify if needed," and (2) tighter sourcing on the three self-distillation papers. Net: a handful of real new facts, two corrections, and confirmation of the central reproducibility gap (hint generation) as still unstated.
+---
+## 1. Data generation / synthetic tasks / data mix / CPT — deltas
+**Blog 2.5 verbatim sentences on data-gen (re-confirmed, unchanged from mapping doc):**
+- *"To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run."*
+- *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."*
+- *"We use a range of approaches for creating synthetic tasks that are grounded in real codebases."*
+- Feature-deletion paragraph + reward-hacking examples (Python type-checking cache; Java bytecode decompile) — verbatim as in mapping doc §2.
+**DELTAS (not in / under-stated in COMPOSER_RECIPE_MAPPING.md):**
+- **[DELTA — new emphasis]** The phrase *"we both **select for** and **create** harder tasks **dynamically throughout the run**"* is a **dynamic curriculum / online task-selection** signal. The mapping doc captured "Feature Deletion + 24 unnamed generators" but did **not** flag that task difficulty is filtered *online* (the model "begins to get most training problems correct," so hard tasks are up-weighted live). This is a data-*mix*/curriculum detail with direct replication impact: our generator suite needs a difficulty filter / pass-rate gate, not just a static task bank.
+- **[DELTA — new authoritative source for CPT data mix]** The Composer 2 technical-report blog states the CPT data mix explicitly: *"continued pretraining on a data mix that **emphasizes code** to deepen the base model's coding knowledge"* and *"We find that **reducing pretraining loss improves downstream RL performance**, with better base knowledge reliably translating into a better agent."* The mapping doc marked "continued pretraining on heavily code-weighted data" as `[BLOG-VERIFIED]` from the 2.5 Muon section — but the **causal claim (CPT loss ↓ ⇒ RL performance ↑)** is new and is the stated *justification* for doing CPT at all. Relevant to our "skip CPT, start from Qwen3-Coder" decision: Cursor's own evidence says base-knowledge quality gates RL ceiling, which strengthens the case for starting from an already-code-tuned base.
+- **[DELTA — new artifact]** There is now a **full Composer 2 arXiv technical report: [arXiv:2603.24477](https://arxiv.org/abs/2603.24477)** and a downloadable PDF at **`https://cursor.com/resources/Composer2.pdf`** (authored by Sasha Rush et al.). The report explicitly *"covers... ablations on the training recipe, our approach to agent behavior shaping, and the design of our evaluation suite."* The mapping doc cited only the blog stub and never the arXiv ID/PDF. **This PDF is the most likely place to resolve the data-mix weighting %, the RL algorithm name, and the hint-generation mechanism — none of which are in either blog.** → Recommend a dedicated follow-up extraction of Composer2.pdf.
+- **[CONFIRM — "Anyrun"]** Mapping doc flagged "Anyrun" as possibly not Cursor-sourced. **Confirmed real:** the Composer 2 report blog says *"**Anyrun**, our internal compute platform for running hundreds of thousands of sandboxed coding environments."* It is a Composer-**2** artifact (carried into 2.5), correctly attributed. Resolves the mapping doc's open flag.
+- **[NO CHANGE]** No data-mix percentages, token counts, generator inventory, or feature-deletion target-selection heuristic appear in either blog. Still the key data-gen reproducibility gap.
+---
+## 2. Targeted textual feedback / on-policy distillation — deltas
+**Blog 2.5 verbatim (re-confirmed, unchanged):** the full "Targeted RL with textual feedback" section — hint→teacher / original-context→student / on-policy KL "for that turn only," applied "to a variety of model behaviors, from coding style to model communication." Matches mapping doc §1 exactly.
+**DELTAS / sharper detail:**
+- **[DELTA — verbatim mechanism nuance the mapping doc compressed]** The blog's causal sentence is more specific than the mapping doc's paraphrase: *"This hint **changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement**. For that turn only, we then **update the student weights towards to the new probabilities**."* Two implementation facts to lift exactly: (i) the teacher distribution is the **hint-conditioned forward pass of the same weights** (not a re-rollout), and (ii) the **student weights are updated** (the KL is a gradient-bearing loss on the student, teacher is stop-grad). Mapping doc had this right conceptually; the verbatim confirms teacher = stop-grad, student = trainable.
+- **[DELTA — secondary-source confirmation, not new fact]** Multiple write-ups (Pulse2, TechTalks) independently describe the mechanism identically ("injects a local textual hint... teacher distribution... student... on-policy KL"). No secondary source reveals **how the hint is generated** — the single most important replication gap from the mapping doc **remains unresolved** across all live sources. No source claims templates vs. LLM-judge vs. learned generator.
+- **[DELTA — coverage breadth]** Blog explicitly lists the behavior targets as **coding style, tool use, and model communication** (Pulse2/DataCamp corroborate). Mapping doc noted style+communication; "tool use" as a distinct third target is worth recording for the v0.1 hint-template taxonomy.
+- **[WATCH — likely secondary-source conflation, flag do-not-cite]** TechTalks (bdtechtalks, 2026-05-25) introduces an **"SDFT"** continued-pretraining self-distillation story ("model generates its own reasoning... distills its own generated logic... constrains weight shift... adapt to a company's coding style without forgetting"). **This is NOT in the Cursor blog** and conflates the footnote-1 *continual-learning* paper (2601.19897) with Composer's RL method. Treat as journalist embellishment, **not** Cursor-stated. Recorded here so a future reader doesn't mistake it for ground truth.
+---
+## 3. Footnote-1 self-distillation papers — arXiv IDs resolve, one-line each
+All three IDs **resolve live** (2026-05-28). Note the mapping doc's footnote ordering differs from the blog's; blog footnote-1 lists them in this order:
+| arXiv | Title | Resolves? | One-line core method (abstract-level) |
+|---|---|---|---|
+| **[2601.19897](https://arxiv.org/abs/2601.19897)** | *Self-Distillation Enables Continual Learning* | ✅ v1 | Self-distillation as a continual-learning regularizer — anchor updates to the model's own prior outputs to acquire new data without catastrophic forgetting. (Abstract body not exposed in listing; title-level only.) |
+| **[2601.20802](https://arxiv.org/abs/2601.20802)** | *Reinforcement Learning via Self-Distillation* (**SDPO**) | ✅ v2 (sub 28 Jan, rev 16 Feb 2026) | **The direct formalization of Composer's method.** "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy... converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model." |
+| **[2601.18734](https://arxiv.org/abs/2601.18734)** | *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs* (**OPSD**) | ✅ v3, code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) | Single LLM; teacher = policy conditioned on privileged info, student = policy without it; per-token on-policy KL on the student's own rollouts. The original OPSD framework. |
+**Corrections to the mapping doc:**
+- **[CORRECTION — SDPO authorship]** Mapping doc says "Hübotter et al., 2026." **Confirmed and now precise:** Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, **Andreas Krause** (ETH Zürich group). 11 authors, CC-BY-4.0, v2 dated 2026-02-16.
+- **[CORRECTION — SDPO abstract claim]** Mapping doc's quoted SDPO comparison-table framing ("environment / rich / on-policy") is a **reasonable gloss but not a verbatim abstract quote**. The actual abstract's strongest reproducibility-relevant claim is the **test-time** result: *"applying SDPO to individual questions at test time... achieving the same discovery probability as best-of-k... with 3x fewer attempts"* and that SDPO *"also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts."* The "successful-rollouts-as-implicit-feedback" trick is a **new lever** the mapping doc didn't capture — relevant if our hint generator is weak/absent (you can bootstrap hints from the model's own successful sibling rollouts). Benchmark: **LiveCodeBench v6**.
+- **[CONFIRM — OPSD code]** `siyan-zhao/OPSD` link confirmed live in the arXiv comments field. The mapping doc's "lift the SDPO loss from OPSD (MIT)" plan stands; verify license on the repo directly (arXiv page shows CC-BY-4.0 for the *paper*, not necessarily the code).
+---
+## 4. RL algorithm / reward-hacking / behavioral reward — deltas
+- **[NO CHANGE — RL algorithm name still absent]** Neither blog names the outer RLVR algorithm (PPO/GRPO/DAPO). Mapping doc's "PPO or GRPO variant `[EXTRAPOLATED]`" verdict stands. The Composer 2 report blog adds only that *"RL training improves both **average and best-of-K** performance, suggesting the model is learning new solution paths rather than just concentrating on known ones"* — an **algorithm-agnostic** observation (best-of-K ↑ implies exploration, not just sharpening). **arXiv:2603.24477 / Composer2.pdf is the place to find the algorithm name.**
+- **[DELTA — reward-hacking mitigation, slightly sharper]** Blog wording: hacks were found *"using **agentic monitoring tools**"* and they *"demonstrate the increasing care necessary for large scale RL."* Still no specifics (no static-analysis/sandbox-lockdown detail). Mapping doc's "build the monitor in v0.1" plan is unaffected; no new implementation handle.
+- **[DELTA — behavioral reward framing]** 2.5 blog: *"we improved behavioral aspects of the model like **communication style and effort calibration**. These dimensions are not well captured by existing benchmarks, but we find that they matter for real-world usefulness."* Mapping doc captured this as `[BLOG-VERIFIED]`. Delta: the blog **strongly implies** (via "we applied this method to a variety of model behaviors, from coding style to model communication") that **behavioral rewards are trained via the targeted-textual-feedback channel itself**, not a separate RM. The Composer 2 report blog also promises *"our approach to **agent behavior shaping**"* as a report section → another reason to pull Composer2.pdf.
+- **[DELTA — net-new context, out of scope but record]** 2.5 blog adds a **SpaceXAI / Colossus 2** paragraph: *"Together with SpaceXAI, we're training a significantly larger model from scratch, using **10x more total compute**. With Colossus 2's **million H100-equivalents**..."* This is a *future from-scratch* model, **not** Composer 2.5's recipe — irrelevant to replication but absent from the mapping doc; recorded so it isn't mistaken for a 2.5 training fact.
+---
+## Action items surfaced by this delta pass
+1. **Pull `https://cursor.com/resources/Composer2.pdf` (arXiv:2603.24477)** — highest-value unread artifact; likely resolves data-mix %, RL-algo name, hint-generation, behavior-shaping. (Recommend a dedicated subagent.)
+2. **Add an online difficulty filter / pass-rate gate** to the synthetic-task generator plan (the "select for... dynamically" delta), not just a static bank.
+3. **Record the SDPO "successful-rollout-as-implicit-feedback" trick** as a hint-bootstrapping fallback for v0.1 when no external hint source exists.
+4. **Update mapping doc citations**: resolve the Anyrun flag (confirmed), add arXiv:2603.24477 + Composer2.pdf, correct SDPO author list, add LiveCodeBench-v6 as SDPO's eval, and append a do-not-cite note on the TechTalks "SDFT" conflation.
+5. Hint-generation mechanism **remains the #1 reproducibility gap** — unresolved by every live source checked.

research/10-composer2-techreport-mining.md ADDED Viewed

	@@ -0,0 +1,136 @@

+# Composer 2 Technical Report — Mining Notes (arXiv:2603.24477)
+> **Extraction date:** 2026-05-28.
+> **Primary source:** Full text of the **Composer 2 Technical Report** (Cursor Research Team; corresponding author Alexander M. "Sasha" Rush), PDF at `https://cursor.com/resources/Composer2.pdf` and arXiv `2603.24477` (v1 25 Mar 2026, v2 26 Mar 2026; cs.SE / cs.LG; "Aaron Chan and 53 other authors").
+> **Method:** `mcp_tavily_tavily_extract` (advanced) on the PDF returned the **complete report body incl. References + Appendices A–C** (~148 KB). Cross-checked against `mcp_exa_crawling_exa` (full re-pull, identical text) and a `mcp_tavily_tavily_search` confirming the arXiv ID, abstract, "Dr. GRPO" passage, and the technical-report blog.
+> **Tagging:** **[REPORT-VERIFIED]** = verbatim/paraphrase from the arXiv report. **[SECONDARY]** = blog/third-party. **[ABSENT]** = explicitly looked for, not in the report.
+> **Scope note:** This report is **Composer 2**, not Composer **2.5**. Several recipe items the 2.5 blog advertises (targeted textual-feedback/hint distillation, "25× synthetic tasks", Sharded Muon) are **not** in this document — see §3 and the corrections box.
+---
+## TL;DR — did it resolve the three open questions?
+| Open question (from delta note 09) | Resolved? | Answer |
+|---|---|---|
+| **RL algorithm NAME** | ✅ **YES** | A multi-sample policy-gradient (GRPO-family) algorithm built explicitly on **Dr. GRPO** [34]: GRPO with the **length-standardization term removed** and **no std-dev advantage normalization**. Optimizer = **Adam**, single-epoch, fixed group size, full-parameter. KL via the **k1 estimator (−log r)**. |
+| **Data-mix weighting % / generator inventory / token counts** | ⚠️ **PARTIAL** | CPT is a **3-phase code-dominated mix** (32k → 256k → SFT) but **no %s and no token counts** are given. RL task mix is given only as a **category histogram (Fig. 3)**, not generator names or weights. No "Feature Deletion" generator inventory (that was 2.5-blog). |
+| **HINT-generation mechanism (targeted textual feedback)** | ❌ **ABSENT** | **The hint/teacher-student textual-feedback mechanism is NOT in the Composer 2 report at all.** It is a Composer **2.5** feature. Composer 2 shapes behavior with **auxiliary scalar rewards + a nonlinear length penalty**, not hint distillation. The #1 reproducibility gap remains unresolved by this artifact. |
+**Net:** The report fully answers the RL-algorithm question (the single biggest win), partially answers data-mix, and does **not** touch hint generation. It also delivers a large amount of previously-unstated **infrastructure** detail (Anyrun internals, async RL stack, MoE router replay, precision recipe) and a **correction** to two prior assumptions (optimizer is Adam not Muon; base is Kimi K2.5 1.04T/32B).
+---
+## 1. Data generation / CPT data-mix / curriculum  [§3, §4]
+### 1.1 Continued pretraining (CPT) — [REPORT-VERIFIED]
+- Base model = **Kimi K2.5** [67], a **1.04T-param / 32B-active MoE** (Appendix B; selected over GLM-5 and DeepSeek V3.2 on internal *FreshBench* knowledge, *State Tracking* (LoCoDiff-style), and *codebase perplexity*; agentic benchmarks **deliberately excluded** from base-model selection "as agentic and long-horizon capabilities can drastically change during the RL stage").
+- CPT is **"a large code-dominated data mix"** done in **three phases**:
+  1. **Bulk of compute at 32k sequence length**,
+  2. a shorter **long-context extension phase to 256k**,
+  3. a short **SFT phase on targeted coding tasks**.
+- Training: **MXFP8 on NVIDIA B300s**, **AdamW** optimizer. Eval loss on internal codebase **"decreases log-linearly"** over the run.
+- **Causal CPT→RL claim (the justification for doing CPT):** they replicate the recipe on **Qwen3-Coder-30B-A3B** at **three log-spaced compute levels (small/medium/large)**, each + identical SFT + identical RL run, and show **"cross-entropy loss is … predictive of downstream RL performance"** (Fig. 2). → Direct support for our "start from an already-code-strong base" decision.
+- **Multi-Token Prediction (MTP):** extra MTP layers [17,11] trained from scratch on the same mix for speculative decoding, via **self-distillation** to the main LM head's logits; MTP layers cut from the **middle** of the CPT run and trained jointly during the long-context + SFT phases. *(This is the only "self-distillation" in the report — it is for MTP/spec-decode, NOT for hints.)*
+- **[ABSENT]** No data-mix percentages, no token/byte counts, no list of CPT data sources.
+### 1.2 RL task distribution & dynamic curriculum — [REPORT-VERIFIED]
+- RL tasks **"run in environments that emulate real Cursor sessions as closely as possible."** Problem distribution **"reflects the most common use cases"**; **Fig. 3** gives the category breakdown (x-axis "% of Problems", ~0–40%): **Iterate On Feature, Debugging, New Feature, Refactor, Understanding Codebase, Documentation, Testing, Code Review, Optimize, Devops, Migration, Deletion, Other.** *(This is the closest the report gets to a "data mix" — categorical, not weighted %s, no generator names.)*
+- **Dynamic difficulty curriculum (verbatim):** *"In later stages of training, we use simple heuristics—such as **number of turns and thinking tokens of rollouts**—to **upsample increasingly harder data points**."* → Confirms delta note 09's "select for harder tasks dynamically" as an **online up-sampling gate keyed on turns + thinking-token count**. Replication handle: rank tasks by rollout length/turn-count, up-weight the long-tail late in training.
+- **[ABSENT]** No synthetic-task **generator inventory** (no "Feature Deletion" et al.), no "25× synthetic tasks" figure, no synthetic-vs-real split. Those are Composer **2.5**-blog claims and are **not** in this report.
+---
+## 2. RL ALGORITHM  [§4.1] — [REPORT-VERIFIED], the headline result
+**Algorithm family:** *"a policy gradient algorithm with multiple samples per prompt [53 = DeepSeekMath/GRPO, 2 = REINFORCE-style RLOO] and a fixed group size."* Operates in the **single-epoch regime** (a prompt is **never trained on twice**). **Adam** optimizer; **full-parameter** update. Highly **asynchronous** (independent train + rollout workers).
+**Specific GRPO modifications (the "name" + the deltas):**
+- Built on **Dr. GRPO** [34 = Liu et al., *Understanding R1-Zero-like training*, arXiv 2503.20783]: verbatim *"As in Dr. GRPO, … crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage."*
+- **Remove the length-standardization term from GRPO** (it "introduces a length bias").
+- **Do NOT normalize group advantages by their standard deviation** — std-norm "results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."
+- **Overlong-rollout masking [78 = DAPO/Yu et al.]: NOT used.** They *"did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length"*; the self-summary system limits overlong cases anyway. *(So: Dr. GRPO-style, explicitly NOT DAPO's overlong masking; DAPO [78] and GSPO [82] are cited but as related work / for router-replay, not adopted wholesale.)*
+**KL regularization — exact formulation [§4.1, Fig. 4]:**
+- Uses **KL(q‖p) = E_{x∼q}[−log r(x)], r(x)=p(x)/q(x)** for regularization (like DeepSeekMath [53] and Kimi k1.5 [66]).
+- **Chooses the k1 estimator `k1 = −log r`** over the popular **k3 = (r−1) − log r** [Schulman 52], because (citing Amini et al. [6]) k3's variance "increases drastically as p and q diverge" — at large KL the k3 estimate variance is "extremely large." (k2 is unbiased-ish but biased per their note.) → **Replication handle: use the simple `−log r` KL penalty, not the k3 unbiased estimator, for agentic long-horizon RL.**
+**Async-rollout infra / off-policy control [§4.1, §6.2]:**
+- Minimize off-policyness via **fast weight sync + in-flight (mid-rollout) weight updates**, *"similar to **PipelineRL** [48]"* — inference workers update weights mid-rollout so later tokens are less off-policy.
+- **MoE router replay [38, 82]:** inference engine returns selected expert indices per token per MoE layer; training forward pass **overrides the router's expert assignment to match** (router still computes gating scores so gradients flow). They **extend** replay by **filtering replayed experts whose gating scores fall below a plausibility threshold from the router's own top-k, replacing them with the router's candidates** — reduces p99 numerics mismatch between inference and training forward passes. *(Critical for MoE-base RL stability; directly relevant if we RL a MoE.)*
+**Reward structure [§4.1–4.2]:**
+- Reward based on **"code's correctness, succinctness, and conformance to software engineering principles."**
+- **best-of-K does NOT trade off vs average:** both rise together over training (Fig. 5) → RL is *expanding* solution coverage, not just sharpening (notable vs the "RL only concentrates mass" literature [79,32,8,74,61]).
+**Reward-hacking safeguards — [ABSENT/THIN]:** This report does **not** contain the Python-typecheck-cache / Java-bytecode reward-hack anecdotes (those are 2.5-blog). The only related safeguards here are **strict tool-argument checks** and **tool removal for steerability** in training environments (§6.2), and general monitoring for **emergent behaviors** (§4.2). No dedicated "agentic monitoring tool" section.
+---
+## 3. Targeted textual feedback / hint distillation  — **[ABSENT]**
+**Finding: The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback / on-policy KL-to-hint-conditioned-teacher mechanism.** Searched the full text for hint / teacher / student / textual feedback / distill — the only "distillation" is **MTP self-distillation to the LM head's logits** (§3.1, spec-decode), unrelated to behavior shaping.
+**What Composer 2 does for behavior shaping instead [§4.2 "Agent Behavior"] — [REPORT-VERIFIED]:**
+- **Auxiliary scalar rewards**, not hints: *"we apply an array of auxiliary rewards … rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished."*
+- **Reactive reward addition:** they "monitor the model for emergent behaviors and occasionally introduce additional behavior rewards" (examples observed: leaving long CoT in code comments; collapsing to terminal-tool-only).
+- **Nonlinear length / effort penalty (exact equation):**
+  `C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))`, concave-down & increasing, where **x = a weighted combination of {thinking tokens, tool-calling tokens, tool-output tokens, final-message tokens, # tool calls, # turns}** and `k, q` are curvature hyperparameters (Fig. 6). Goal: be quick on easy tasks, think longer on hard tasks; observed to induce **parallel tool calls**.
+- **Self-Summarization [§4.1, from Composer 1.5 [64]]:** rollouts are chains joined by self-summaries; **final reward is assigned to all tokens in the chain** (up-weights good agent turns *and* the summaries that enabled them; down-weights lossy summaries). Reduces error vs prompt-based compaction while using fewer tokens and reusing KV cache.
+> **Implication for the replication framework:** To reproduce Composer **2.5**'s hint mechanism we still must look elsewhere — the **SDPO (arXiv 2601.20802) / OPSD (2601.18734)** papers from delta note 09 remain the only formalizations, and **how Cursor generates the hint text itself is still unstated in every Cursor artifact.** Composer 2's behavior shaping (auxiliary rewards + the length-penalty equation above) is a **fully reproducible, hint-free alternative** we can adopt for v0.1.
+---
+## 4. Other replication-relevant detail [§6 Infrastructure, §5 CursorBench, App.] — [REPORT-VERIFIED]
+**Optimizer — CORRECTION:** report says **AdamW (CPT) / Adam (RL)**. **There is NO "Sharded Muon" in the Composer 2 report** — the Muon claim came from the 2.5 blog and should be tagged 2.5-only / re-verified, not assumed for Composer 2.
+**Parallelism / sharding layout — CORRECTION to "HSDP":**
+- Prior stacks used **FSDP + EP + TP** (EP coupled to TP). **Composer 2 decouples EP from TP** and uses **Context Parallelism (CP)** as the primary long-context axis (less comm than TP; CP folded into the FSDP dim). **No mention of "HSDP"** — the doc says **FSDP/ZeRO [50,81] + CP + decoupled EP**, **DeepEP** [80] for token dispatch/combine.
+- **Exact degrees:** **EP=8, CP=2 for CPT**; **EP=8, CP=8 for RL.** MLA attention with latent-vector all-gather trick; Llama-style 2×CP chunk load-balancing [33].
+- **Global sequence packing** before each RL step to balance DP compute across variable-length rollouts (accounts for quadratic attention cost).
+**Precision recipe [§6.1]:**
+- **MoE forward = a novel NVFP4 variant**: BF16→FP4E2M1 with **FP8E4M3 per-block scales (block 16) + FP32 per-token scales** (per-tensor FP32 scales were "fragile" → batch-variance collapse + future-token leakage/biased grads). **MoE backward = standard MXFP8** (FP8E4M3 values, FP8E8M0 scales per 32-elt block) — afford higher precision since backward runs only on the train cluster. Trainer forward must **numerically match inference** for stability. IEEE `__fdiv_rn` **critical** for NVFP4 (fast-approx diverges ~100 RL steps); fast-approx OK for MXFP8.
+- Kernels in **CUDA/PTX/ThunderKittens-ParallelKittens** [56,59]; FA4 backward (DeepSeek QK192/V128 shapes) co-developed w/ Colfax; GEMMs open-sourced into ThunderKittens [21].
+**RL infra [§6.2] — 4 decoupled services (training / environments / inference / evals):**
+- **Training:** fully async on **Ray [42] + PyTorch**, centralized **reconciler** w/ slot-based sample lifecycle + staleness-balancing scheduler; **futures**-based eager exec; Ray object store w/ NVMe spill; fault-tolerant to process-group level, warm-standby nodes, live code updates; **policy-aware rollout-level + group-level checkpointing** (codebase memory snapshots; advantage-tagged sequences w/ policy versions to NFS). Production run spanned **3 GPU regions + 4 CPU regions.**
+- **Anyrun (verbatim internals):** *"an internal compute platform built for running untrusted code at scale … the same platform that powers Cloud Agents and Automations."* Global router → multiple Anyrun clusters; each cluster schedules **>500 pods/sec**, manages **hundreds of thousands of pods/cluster**; **each pod = a dedicated Firecracker VM** (full dev env incl. browser/GUI for computer use); x86+ARM mix; pressure-aware bin-packing. **Forking & snapshotting at filesystem + memory level** (→ mid-trajectory checkpoint, post-rollout introspection); same-node fork preferred else live-migrate. **Anygress** egress proxy (TCP-layer redirect via injected root CA, header stripping). **Shadow deployment of the Cursor backend** for faithful tools; tools dynamically per-environment (stricter arg checks / tool removal in training).
+- **Inference:** **partner = Fireworks AI.** Every step, weights synced to inference via **S3 with per-rank delta compression** (RL diffs compress to "a handful of GB" for the 1T model); sharded upload/download; geo-distributed US+EU clusters reconstruct from the shared delta chain (no direct train↔inference connectivity).
+- **Online evals:** pinned production backend + Cursor client per eval job; lease an eval deployment, move GPUs, cross-region weight sync.
+**CursorBench (eval-suite design) [§5]:**
+- Internal suite from **real Cursor engineering-team agent sessions** (avoids train-set contamination). Motivated by 4 failure modes of public benchmarks (domain mismatch, prompt over-specification, contamination/overfit, narrow scope).
+- **Quantified hardness vs public sets:** median **181 lines changed** (vs 7–10 for SWE-bench Verified/Multilingual) and median prompt length **390 chars** (vs 1,185–3,055) → larger + more under-specified. Versioned (**CursorBench-3** > 2× the median task size of v1; Table 1 uses CursorBench-3).
+- **Targeted sub-evals:** intent, instruction-following, **eager-editing** (don't edit when you shouldn't), code-quality (LLM-judge rubrics), **interruption** (mid-rollout user feedback). Built by "identifying dimensions, selecting eliciting data points, writing rubrics."
+- **Headline results (Table 1):** Composer 2 = **CursorBench 61.3 / SWE-bench Multilingual 73.7 / Terminal-Bench 61.7**; Kimi K2.5 base = 36.0 / 65.1 / 47.3 → large RL+CPT lift.
+**Ablations actually present (for "ablations on the training recipe"):**
+1. **CPT→RL** (Qwen3-Coder-30B, 3 compute levels; Fig. 2) — CE loss predicts RL reward.
+2. **KL estimator** k1 vs k3 (Fig. 4) — variance argument for k1.
+3. **GRPO term removals** — length-standardization & std-norm removed (qualitative justification, no head-to-head curve).
+4. **Overlong masking** — tried, no benefit at small scale, dropped.
+5. **NVFP4 scaling scheme** (per-token vs per-tensor) and **IEEE vs fast-approx division** — stability ablations.
+6. **best-of-K vs average** over training (Fig. 5).
+*(No single consolidated "leave-one-out recipe component" ablation table; ablations are distributed and partly qualitative.)*
+---
+## Corrections / cautions for the mapping doc
+- **[CORRECTION] Optimizer:** Composer **2** uses **Adam/AdamW**, **not Muon**. Treat "Sharded Muon" as a **2.5-blog-only, unverified-for-2** claim.
+- **[CORRECTION] Sharding:** report describes **FSDP+CP+decoupled-EP (EP=8/CP=2 CPT, EP=8/CP=8 RL)**, **not "HSDP."**
+- **[CORRECTION] "RL algorithm = PPO/GRPO `[EXTRAPOLATED]`"** → now **[REPORT-VERIFIED] Dr. GRPO-style** (length-std removed, no std-norm, k1 KL, Adam, single-epoch, MoE router-replay). DAPO overlong-masking explicitly rejected.
+- **[CONFIRM] Anyrun** real, with full internals (Firecracker VMs, >500 pods/s, fork/snapshot, Anygress).
+- **[CONFIRM] base model = Kimi K2.5 1.04T/32B** (over GLM-5, DeepSeek V3.2).
+- **[CAUTION] Hint mechanism, "25× synthetic tasks", Feature-Deletion generator, reward-hack anecdotes are NOT in this (Composer 2) report** — do not cite this PDF for them; they are Composer 2.5-blog material.
+---
+## Sources
+- **[PRIMARY, REPORT-VERIFIED]** Cursor Research Team, *Composer 2 Technical Report*, arXiv:**2603.24477** (v1 2026-03-25, v2 2026-03-26; cs.SE/cs.LG; corr. Alexander M. Rush). Full text via PDF `https://cursor.com/resources/Composer2.pdf` (Tavily advanced extract, full body+refs+App. A–C) and cross-checked via Exa full crawl (identical). HTML/TeX also available at `https://arxiv.org/abs/2603.24477`, `https://arxiv.org/pdf/2603.24477`.
+- **[SECONDARY]** Cursor blog, *A technical report on Composer 2* (Sasha Rush) — `https://cursor.com/blog/composer-2-technical-report` (abstract-level; confirms Kimi K2.5 base + CPT-loss→RL claim).
+- **[CONTEXT]** Key cited methods: Dr. GRPO (Liu et al., arXiv 2503.20783 [34]); DAPO (Yu et al. [78], 2503.14476/NeurIPS'25); GSPO (Zheng et al., 2507.18071 [82]); DeepSeekMath/GRPO [53]; PipelineRL (2509.19128 [48]); MoE router alignment (Ma et al., 2510.11370 [38]); KL-estimator variance (Amini et al. [6]); Schulman KL note [52]; DeepEP [80]; ThunderKittens/ParallelKittens [56,59].
+- **Prior internal note:** `research/09-composer-blog-delta-2026.md` (read first; this note discharges its action item #1 and supplies corrections to the RL-algorithm/optimizer/sharding rows of `docs/COMPOSER_RECIPE_MAPPING.md`).