docs(adr): add ADR-008/009/010 (Dr.GRPO+SDPO, layered hints, FeatureDeletionEnv)

Phase-4 of the deep-work-loop. Three proposed ADRs translating the Composer 2.5
data-gen + targeted-textual-feedback research into durable decisions, each with
mechanized acceptance gates (status stays `proposed` until gates green):

- ADR-008: target Dr. GRPO (length-std off, no std-norm, k1 KL, Adam,
single-epoch — per arXiv:2603.24477) and host the live SDPO channel in the
TRL ComposerReplicationTrainer. amends ADR-006 (SDPO needs full logits ->
TRL-only; PRIME-RL hosts channels 1+3). preserves ADR-007.
- ADR-009: layered HintGenerator (template -> raw-error -> LLM-judge ->
SDPO sibling-bootstrap) behind a typed Protocol; preserves the existing
templates + CollatorConfig.hint_generator hook.
- ADR-010: FeatureDeletionEnv synthetic-data subsystem inverting OSS SWE
substrates (SWE-bench-Lite/SWE-Gym/R2E-Gym/SWE-rebench) + online difficulty
gate + reward-hacking safeguards.

ADR-006 patched: Amended-by header + stale matrix row qualified. Index added.

Prior-ADR audit (adr-methodology 3a): 008 amends 006 / preserves 007;
009+010 preserve all. Greenfield-check finding: ComposerReplicationTrainer
already EXISTS+complete (008 = validate+config, not author); hint_generator.py
exists as flat dispatch (009 = extend to layered); FeatureDeletionEnv absent
(010 = build).

Note: cross-family adversarial ADR review was blocked by OpenRouter 402
(out of credits); substituted a focused self-administered pre-mortem
(found + fixed ADR-006 stale-matrix-row). Re-run cross-family review when
credits restored.

Files changed (5) hide show

docs/adrs/ADR-006-rl-frameworks.md +2 -1
docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md +118 -0
docs/adrs/ADR-009-layered-hint-generator.md +106 -0
docs/adrs/ADR-010-feature-deletion-datagen.md +108 -0
docs/adrs/README.md +16 -0

docs/adrs/ADR-006-rl-frameworks.md CHANGED Viewed

@@ -3,6 +3,7 @@
 **Status**: Accepted
 **Date**: 2026-05-26
 **Wave**: 13
 ## Context
@@ -100,7 +101,7 @@ rewarder) maps naturally onto Monarch primitives.
 |---|---|
 | Quick start, single-cluster, ≤7B | TRL Recipe A |
 | Production multi-node, ≤32B | VeRL Recipe B |
-| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) |
 | Coordination-heavy multi-actor RL | Monarch + any of the above |
 ### Trade-offs explicitly accepted

 **Status**: Accepted
 **Date**: 2026-05-26
 **Wave**: 13
+**Amended-by**: ADR-008 — the implication that any of the three recipes can host the full 3-channel loss is amended: the **SDPO channel requires full vocabulary logits and is TRL-hosted only**. PRIME-RL hosts channels 1+3 (PG + trace-replay-DPO); its `LossInputs` exposes log-probs, not logits, so `recipes/prime_rl/composer_loss.py` raises `NotImplementedError` for `alpha_sdpo>0` until upstream PRIME-RL exposes logits. The framework selection, Monarch decision, and three-recipe matrix below are preserved.
 ## Context
 |---|---|
 | Quick start, single-cluster, ≤7B | TRL Recipe A |
 | Production multi-node, ≤32B | VeRL Recipe B |
+| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) — channels 1+3 only; **SDPO channel TRL-hosted, see ADR-008** |
 | Coordination-heavy multi-actor RL | Monarch + any of the above |
 ### Trade-offs explicitly accepted

docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+status: proposed
+date: 2026-05-29
+amends: ADR-006
+deciders: [Codeseys, ARIA]
+---
+# ADR-008: Target Dr. GRPO as the RL base and host the live SDPO channel in the TRL trainer
+## Context and Problem Statement
+Composer 2.5's "Targeted RL with Textual Feedback" (= SDPO / on-policy
+self-distillation) is the recipe's secret sauce. The framework already
+*proved the SDPO loss math end-to-end* (`compose_loss` + a real
+forward/backward/optimizer step on Qwen2.5-0.5B, `examples/sdpo_real_trace_train_smoke`),
+and `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer`
+already overrides `_compute_loss` to add `α·sdpo + β·replay` on top of the
+parent GRPO loss. What is **not** done: (a) the trainer has never been
+instantiated against a real `trl.GRPOTrainer` and smoke-tested; (b) the
+parent RL objective is unspecified ("GRPO + DAPO patches" in the docstring);
+(c) the SDPO channel carries a documented trust-gap (composer_trainer.py
+lines 158-160 assume the collator aligns student/teacher lengths).
+Research doc `research/10-composer2-techreport-mining.md` (mining the newly
+available Composer 2 technical report, arXiv:2603.24477) **resolves the RL
+algorithm**: it is **Dr. GRPO** — GRPO with the length-standardization term
+removed, **no std-dev advantage normalization**, Adam, single-epoch (a
+prompt is never trained twice), KL-to-reference via the **k1 estimator
+(`−log r`)** not k3, with DAPO overlong-masking explicitly tried and
+rejected. Research doc `research/08-sdpo-grpo-integration.md` establishes
+that the SDPO channel needs **full vocabulary logits** and therefore must
+live in the TRL `GRPOTrainer` subclass — the PRIME-RL recipe's `LossInputs`
+exposes per-token log-probs only and `recipes/prime_rl/composer_loss.py`
+correctly raises `NotImplementedError` for `alpha_sdpo>0`.
+### Relationship to ADR-006 (RL framework strategy) — `amends`
+ADR-006 chose TRL + VeRL + PRIME-RL as three co-equal RL recipe hosts and
+its three-recipe production matrix **remains in force**. This ADR amends one
+specific point: ADR-006 implied any of the three recipes could host the full
+3-channel loss. That is **not true for the SDPO channel** — it requires full
+logits, which only the TRL host exposes. This ADR records that the **SDPO
+channel is TRL-hosted**; PRIME-RL hosts channels 1+3 (PG + trace-replay-DPO)
+and is blocked on SDPO until upstream PRIME-RL exposes logits in `LossInputs`.
+ADR-006's framework selection, Monarch decision, and matrix are preserved.
+### Relationship to ADR-007 (self-distillation losses) — `preserves`
+ADR-007 decided *which* distillation losses exist (`generalized_jsd_loss`,
+SimPO, TAID, Entropy-Aware OPD) and their `compose_loss` kwarg surface. This
+ADR is about *wiring the SDPO JSD live into a Dr. GRPO rollout→update loop*
+— orthogonal. ADR-007 is fully preserved; the SDPO channel here calls the
+exact `generalized_jsd_loss` ADR-007 governs.
+## Decision Drivers
+- The actual Composer RL algorithm is now known (Dr. GRPO); we should match
+  it rather than ship an unspecified "GRPO + DAPO patches" base.
+- SDPO needs full logits → host constraint is real, not preference.
+- The trainer class already exists; the cost is validation + Dr-GRPO config
+  + closing the alignment trust-gap, not authoring from scratch.
+- Must be CPU-smoke-testable so we validate without GPU spend.
+## Considered Options
+- **A. Target Dr. GRPO, host SDPO in the TRL `ComposerReplicationTrainer`** (chosen)
+- **B. Keep the unspecified "GRPO + DAPO" base, host SDPO in PRIME-RL**
+- **C. Vanilla GRPO base (with std-norm + length-standardization), TRL host**
+## Decision Outcome
+Chosen: **Option A** — configure the parent TRL `GRPOTrainer` to the Dr. GRPO
+variant (scale_rewards off / no std-norm, no length-standardization, k1 KL,
+Adam, single-epoch via `num_iterations=1`) via a `make_dr_grpo_config` helper,
+keep the SDPO + trace-replay channels in `ComposerReplicationTrainer`, close
+the student/teacher alignment trust-gap with an explicit assertion + collator
+guarantee, and add a CPU smoke test that instantiates the trainer and runs a
+1–2 rollout Dr. GRPO step with the SDPO channel on.
+### Consequences
+- **Positive**: Matches the *actual* Composer 2 RL algorithm (primary-sourced), not a guess.
+- **Positive**: Reuses the already-written trainer; small, validatable delta.
+- **Positive**: SDPO channel is exercised live, closing the gap between "loss math proven" and "loss runs inside RL."
+- **Negative**: SDPO is TRL-only for now; the PRIME-RL decentralized path (ADR-006's headline for DiLoCo-shape runs) cannot use the SDPO channel until upstream exposes logits. This narrows the decentralized story to channels 1+3.
+- **Negative**: A second teacher forward per error-site minibatch ~doubles the forward cost on error-bearing batches (acceptable; error batches are a minority and the teacher forward is `no_grad`).
+- **Neutral**: TRL's GRPOTrainer config knobs for Dr. GRPO (e.g. `scale_rewards`) are version-sensitive; pinned + guarded.
+## Pros and Cons of the Options
+### A. Dr. GRPO + TRL host
+- Good: primary-sourced algorithm match; reuses existing trainer; full logits available for SDPO.
+- Good: CPU-smoke-testable end-to-end.
+- Bad: SDPO confined to TRL until PRIME-RL exposes logits.
+### B. Unspecified base + PRIME-RL host
+- Good: PRIME-RL is the decentralized/DiLoCo-shape host (ADR-006).
+- Bad: **PRIME-RL cannot host SDPO** (log-probs only) — `NotImplementedError` today. Fatal for the SDPO channel.
+- Bad: leaves the RL algorithm unspecified despite the tech report resolving it.
+### C. Vanilla GRPO + TRL host
+- Good: TRL default; least config work.
+- Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
+## Acceptance gate (must be green before status flips to accepted)
+- [ ] `make_dr_grpo_config(...)` helper exists and sets: no std-norm advantage scaling, no length-standardization, Adam, `num_iterations=1` (single-epoch), k1 KL — each value asserted in a unit test against the resulting `GRPOConfig`.
+- [ ] `ComposerReplicationTrainer._compute_sdpo_loss` lines 158-160 trust-gap closed: an explicit shape-alignment assertion (not just a warning-and-skip) with a unit test that a misaligned batch raises rather than silently zeroes.
+- [ ] CPU smoke test instantiates `ComposerReplicationTrainer` with a real `trl.GRPOTrainer` parent on Qwen2.5-0.5B, runs ≥1 Dr. GRPO update step with `alpha_sdpo>0` on an error-bearing batch, asserts: total loss finite, `loss/sdpo_kl > 0` logged on ≥1 step, a param moved. (Mirrors `examples/sdpo_real_trace_train_smoke`.)
+- [ ] TRL version pinned in `[train]` extra; the Dr-GRPO config knobs guarded with a version check that fails loudly if the knob names drift.
+- [ ] `recipes/prime_rl/composer_loss.py` `NotImplementedError(alpha_sdpo>0)` path has a test asserting it raises with a message pointing to the TRL host (documents the ADR-006 amendment in code).
+## More Information
+- `research/08-sdpo-grpo-integration.md` — full integration design + `ComposerGRPOTrainer` sketch.
+- `research/10-composer2-techreport-mining.md` — Dr. GRPO resolution (arXiv:2603.24477).
+- Dr. GRPO: Liu et al., *Understanding R1-Zero-like training*, arXiv:2503.20783.
+- SDPO: Hübotter et al., arXiv:2601.20802; OPSD: Zhao et al., arXiv:2601.18734.

docs/adrs/ADR-009-layered-hint-generator.md ADDED Viewed

	@@ -0,0 +1,106 @@

+---
+status: proposed
+date: 2026-05-29
+deciders: [Codeseys, ARIA]
+---
+# ADR-009: Adopt a layered HintGenerator architecture for SDPO textual feedback
+## Context and Problem Statement
+Composer 2.5's targeted-textual-feedback method inserts a short natural-language
+hint at each error turn; the hint-conditioned forward pass becomes the SDPO
+teacher. **How Cursor generates that hint is unstated in every Cursor artifact**
+— both blogs and the full Composer 2 technical report
+(`research/10-composer2-techreport-mining.md` confirms ABSENT). The cited papers
+bracket the answer: OPSD (arXiv:2601.18734) conditions the teacher on
+ground-truth/reference; SDPO (arXiv:2601.20802) generalizes to environment
+feedback and ablates three feedback sources, including the "successful sibling
+rollout as implicit feedback" trick. So hint generation is genuinely *our*
+design problem, and the project's own audit calls it "the single most important
+open question for replication."
+The framework today has `composer_replication/hint_generator.py`: a flat
+registry of 5 templates (`tool_not_found`, `json_decode`, `type_error`,
+`runtime_error`, `repeated_failure`) behind a `dispatch(error_kind, ctx)`
+function. This covers the easy tool-error case but: (a) is not a typed
+interface the collator can compose against (the collator's
+`CollatorConfig.hint_generator` hook takes a callable, but there's no
+`Protocol`); (b) has no path for style/communication errors (which need
+natural language, not templates); (c) has no fallback when no template matches;
+(d) cannot use the SDPO sibling-bootstrap lever. Research doc
+`research/07-sdpo-hint-generator.md` designs the upgrade.
+### Relationship to existing code — `preserves` + extends
+This ADR **preserves** the existing 5 templates and `dispatch`/`register`
+functions (they become the `TemplateHintGenerator` layer) and the
+`CollatorConfig.hint_generator` hook signature. It adds a typed
+`HintGenerator` Protocol and composite layering on top. No prior ADR governs
+hint generation, so there is nothing to supersede.
+## Decision Drivers
+- Hint generation is the #1 replication gap; the design must be honest about
+  which behavior classes each layer can actually cover.
+- Templates are free and cover tool errors; style/communication need an LLM;
+  some sites have no external hint source at all.
+- Must slot into the existing collator hook with zero collator changes.
+- Cost-awareness: most error sites should hit the free template path.
+## Considered Options
+- **A. Layered HintGenerator: template → raw-error → LLM-judge → SDPO sibling-bootstrap, behind a typed Protocol** (chosen)
+- **B. Keep flat template `dispatch` only** (status quo)
+- **C. LLM-judge only (one strong model generates every hint)**
+## Decision Outcome
+Chosen: **Option A** — a `HintGenerator` Protocol with `.generate(error_context) -> str | None`,
+implemented as a `CompositeHintGenerator` that tries layers cheapest-first:
+(1) `TemplateHintGenerator` (the existing 5 templates + more), (2) raw
+tool-error text as the hint, (3) `LLMJudgeHintGenerator` (OpenRouter, ~$0.0005/site,
+for style/communication/effort sites templates can't cover), (4) SDPO
+sibling-bootstrap (use a successful sibling rollout as implicit feedback when
+no external hint source exists). The composite exposes `as_collator_hook()`
+returning a callable matching the existing `CollatorConfig.hint_generator`
+signature.
+### Consequences
+- **Positive**: Covers all four Composer behavior classes (tool use / style / communication / effort), not just tool errors.
+- **Positive**: Cost-bounded — free template path handles the majority; LLM-judge only on the residual.
+- **Positive**: Zero collator change (drops into the existing hook).
+- **Negative**: The LLM-judge layer introduces a network dependency + per-site cost + nondeterminism into data prep; must be optional and cached. Mitigated by ordering it after the free layers and adding a disk cache keyed on the error context hash.
+- **Negative**: Sibling-bootstrap requires multiple rollouts per prompt to have a successful sibling — only available in the RL-rollout path, not in offline-trace ingestion. Documented as RL-path-only.
+- **Neutral**: More moving parts than a flat dict; mitigated by the Protocol + per-layer unit tests.
+## Pros and Cons of the Options
+### A. Layered Protocol
+- Good: honest behavior-class coverage; cost-bounded; composable; testable per layer.
+- Good: preserves existing templates as the first layer.
+- Bad: more surface area; LLM layer adds nondeterminism (cached + optional).
+### B. Flat templates only (status quo)
+- Good: free, deterministic, already shipping.
+- Bad: cannot cover style/communication/effort sites at all — exactly the behaviors Composer's blog says the method targets.
+### C. LLM-judge only
+- Good: maximal coverage, natural language for every site.
+- Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
+## Acceptance gate (must be green before status flips to accepted)
+- [ ] `HintGenerator` Protocol defined with `.generate(error_context: HintContext) -> str | None`; `mypy`/pyright clean.
+- [ ] `TemplateHintGenerator` wraps the existing 5 templates; a test asserts byte-identical output to the current `dispatch()` for all 5 kinds (no regression).
+- [ ] `CompositeHintGenerator` tries layers in cost order; a test asserts a tool_not_found site is served by the template layer (no LLM call) and a style site falls through to the judge layer (mocked).
+- [ ] `LLMJudgeHintGenerator` has a disk cache keyed on error-context hash; a test asserts a second identical call hits the cache (zero network).
+- [ ] `as_collator_hook()` returns a callable accepted by `CollatorConfig.hint_generator` without collator changes; an end-to-end test runs ingestion → collator-with-composite-hook → a non-empty `sdpo_loss_mask` on a real error trace.
+- [ ] Sibling-bootstrap layer is gated behind an explicit `enable_sibling_bootstrap` flag and documented as RL-rollout-path-only; a unit test asserts it returns None in the offline-trace path.
+## More Information
+- `research/07-sdpo-hint-generator.md` — full taxonomy, template strings, judge prompt, cost analysis.
+- `research/09-composer-blog-delta-2026.md` — the SDPO sibling-bootstrap lever.
+- Existing: `composer_replication/hint_generator.py`, `composer_replication/trainer/data_collator.py` (`CollatorConfig.hint_generator`).

docs/adrs/ADR-010-feature-deletion-datagen.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+status: proposed
+date: 2026-05-29
+deciders: [Codeseys, ARIA]
+---
+# ADR-010: Build a FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates
+## Context and Problem Statement
+Composer 2.5 trained on "25× more synthetic tasks than Composer 2," with
+**Feature Deletion** as a named generator: take a repo with passing tests,
+delete code/features, the agent must reimplement to make the tests pass —
+tests are the verifiable reward. The Cursor blog also notes the data is a
+**dynamic difficulty curriculum** ("we both select for and create harder
+tasks dynamically throughout the run"); the Composer 2 tech report adds the
+curriculum is keyed on rollout #turns + thinking-token count
+(`research/10-composer2-techreport-mining.md`). The blog reports real
+reward-hacking (decompiling Java bytecode, reverse-engineering Python
+type-check caches to recover deleted signatures) mitigated by "agentic
+monitoring tools."
+The framework has **no synthetic-data-generation subsystem at all** — this is
+genuinely greenfield. Codeseys' ask is to bring Composer's dataset-generation
+approach into the framework as a real, reusable subsystem (useful beyond this
+project). Research doc `research/06-feature-deletion-datagen.md` designs it
+and identifies ready OSS substrates: SWE-bench-Lite, SWE-Gym (2.4k), R2E-Gym
+(8.1k), SWE-rebench (21.3k), Nemotron/OpenHands (59k) — each ships
+`(repo, gold_patch, FAIL_TO_PASS, PASS_TO_PASS, test_cmd)` tuples that invert
+directly into Feature-Deletion tasks (revert the gold patch → broken repo;
+`FAIL_TO_PASS` tests = the reward signal; `PASS_TO_PASS` = the don't-break-
+existing-behavior guard).
+### Relationship to prior ADRs — `preserves`
+No prior ADR governs synthetic data generation. ADR-002 (trace source) chose
+Claude Code JSONL for *ingestion of real traces* — orthogonal to *generating*
+tasks. This ADR preserves everything; it adds a new `composer_replication.datagen`
+package.
+## Decision Drivers
+- Reuse existing verified OSS substrates rather than scraping repos from scratch — they already guarantee test-exercises-the-code via `FAIL_TO_PASS`.
+- The env must fit the framework's RL side (TRL GRPOTrainer + verifiers / OpenEnv-compatible) so generated tasks feed the same training loop.
+- Reward-hacking is a *real* reported failure, not theoretical — safeguards must be in the design, not deferred.
+- The online difficulty curriculum is part of the recipe, not optional.
+- Subsystem should be reusable for the owner's other project (general "verifiable-reward task generation over a code repo").
+## Considered Options
+- **A. `FeatureDeletionEnv` that inverts OSS SWE substrates (revert gold patch) + online pass-rate difficulty gate + sandbox/AST reward-hacking safeguards** (chosen)
+- **B. Greenfield repo-scraping generator (clone arbitrary GitHub repos, delete AST nodes, hope tests cover them)**
+- **C. Skip generation; reuse SWE-bench-lite tasks as-is without a deletion/inversion layer**
+## Decision Outcome
+Chosen: **Option A** — a `composer_replication.datagen` package with a
+`FeatureDeletionTask` dataclass, a `FeatureDeletionEnv` (Gym/OpenEnv-style
+`reset`/`step`/`reward` where reward = masked test-pass fraction), substrate
+adapters that invert the 5 OSS datasets by reverting their gold patch, a
+4-gate solvability validator (broken repo fails `FAIL_TO_PASS`, passes
+`PASS_TO_PASS`, gold patch restores green, deletion is reachable from tests),
+an online pass-rate difficulty gate, and reward-hacking safeguards
+(pre-task scrub of `__pycache__`/`.mypy_cache`/`.class`/`.git`; allowlisted
+sandbox without `find`/`strings`/`unzip`/decompilers; AST provenance monitor
+that masks reward when deleted symbols reappear via non-implementation paths).
+A TRL `reward_fn(prompts, completions, **kwargs) -> list[float]` adapter wires
+it to the RL loop.
+### Consequences
+- **Positive**: Verifiable reward for free (tests already exist + are known to exercise the code via `FAIL_TO_PASS`); no need to generate or trust new tests.
+- **Positive**: Reusable general subsystem — "invert a solved-repo dataset into a reimplement-to-pass task" works for the owner's other project.
+- **Positive**: Online difficulty gate matches the actual recipe.
+- **Negative**: Bounded to what the OSS substrates cover (Python-dominant; SWE-bench is Python/JS-heavy). Other languages need new substrates. Documented as a known coverage limit.
+- **Negative**: Running tests in a sandbox requires Docker images per substrate; CPU-pool generation has real wall-clock cost (~15 node-days to invert all 21k SWE-rebench tasks per research/06). Mitigated by reusing the substrates' published Docker images and generating lazily.
+- **Negative**: Reward-hacking safeguards are a moving target; the AST provenance monitor is heuristic and will have false negatives. Mitigated by treating it as defense-in-depth (sandbox lockdown is the primary control) and logging suspected hacks for review.
+- **Neutral**: Adds a `[datagen]` optional extra (datasets, docker SDK).
+## Pros and Cons of the Options
+### A. Invert OSS substrates
+- Good: tests guaranteed to exercise the deleted code; verified datasets; reusable; matches recipe.
+- Bad: language/domain coverage bounded by the substrates; sandbox/Docker cost.
+### B. Greenfield repo scraping
+- Good: unbounded repo universe.
+- Bad: no guarantee remaining tests exercise the deleted code (the hard part SWE-bench already solved); huge validation burden; reinvents SWE-Gym. High effort, low marginal value.
+### C. Reuse SWE-bench-lite as-is
+- Good: zero build.
+- Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
+## Acceptance gate (must be green before status flips to accepted)
+- [ ] `FeatureDeletionTask` dataclass + `FeatureDeletionEnv` (`reset`/`step`/`reward`) implemented; reward = masked test-pass fraction with a unit test on a synthetic mini-repo.
+- [ ] One substrate adapter (SWE-bench-Lite, smallest) inverts ≥1 real task: revert gold patch → broken repo, a test asserts the broken repo FAILS `FAIL_TO_PASS` and PASSES `PASS_TO_PASS`, and applying the gold patch restores green. (Runs in the substrate's Docker image; gated `skipif` on docker availability for CI.)
+- [ ] 4-gate solvability validator implemented; a test asserts a task with an unreachable deletion (no test exercises it) is rejected.
+- [ ] Reward-hacking safeguard: a test asserts the sandbox lacks `find`/`strings`/`unzip` and that `__pycache__`/`.mypy_cache` are scrubbed pre-task; the AST provenance monitor masks reward on a crafted "symbol reappears via import of a sibling cache" hack.
+- [ ] Online difficulty gate: a unit test asserts tasks are rankable by a difficulty signal (turns/thinking-token proxy) and the gate up-weights the hard tail.
+- [ ] TRL `reward_fn(prompts, completions, **kwargs) -> list[float]` adapter exists; a test asserts it returns one float in [0,1] per completion = test-pass fraction.
+- [ ] `[datagen]` optional extra added to `pyproject.toml`; `pip install -e .[datagen]` resolves.
+## More Information
+- `research/06-feature-deletion-datagen.md` — full design: substrates w/ HF ids + licenses, deletion mechanics, safeguards, env + reward_fn sketches, cost verdict.
+- Substrates: SWE-bench (arXiv:2310.06770), SWE-Gym (arXiv:2412.21139), R2E-Gym, SWE-rebench.
+- Composer data-gen: `docs/COMPOSER_RECIPE_MAPPING.md` §2, `research/10-composer2-techreport-mining.md` §1.2 (curriculum).

docs/adrs/README.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Architecture Decision Records
+| # | Title | Status | Date |
+|---|-------|--------|------|
+| [ADR-001](ADR-001-gpu-venue.md) | GPU venue | accepted | — |
+| [ADR-002](ADR-002-trace-source.md) | Trace source | accepted | — |
+| [ADR-003](ADR-003-diloco-impl.md) | DiLoCo implementation | accepted | — |
+| [ADR-004](ADR-004-replaysim-normalization.md) | ReplaySim normalization | accepted | — |
+| [ADR-005](ADR-005-serverless-diloco.md) | Serverless DiLoCo | accepted | — |
+| [ADR-006](ADR-006-rl-frameworks.md) | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 |
+| [ADR-007](ADR-007-self-distillation-losses.md) | Self-distillation losses landscape | accepted | 2026-05-26 |
+| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | proposed | 2026-05-29 |
+| [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | proposed | 2026-05-29 |
+| [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | proposed | 2026-05-29 |
+Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.