Codeseys commited on
Commit
36ab61e
·
1 Parent(s): 6049d00

docs(adr): add ADR-008/009/010 (Dr.GRPO+SDPO, layered hints, FeatureDeletionEnv)

Browse files

Phase-4 of the deep-work-loop. Three proposed ADRs translating the Composer 2.5
data-gen + targeted-textual-feedback research into durable decisions, each with
mechanized acceptance gates (status stays `proposed` until gates green):

- ADR-008: target Dr. GRPO (length-std off, no std-norm, k1 KL, Adam,
single-epoch — per arXiv:2603.24477) and host the live SDPO channel in the
TRL ComposerReplicationTrainer. amends ADR-006 (SDPO needs full logits ->
TRL-only; PRIME-RL hosts channels 1+3). preserves ADR-007.
- ADR-009: layered HintGenerator (template -> raw-error -> LLM-judge ->
SDPO sibling-bootstrap) behind a typed Protocol; preserves the existing
templates + CollatorConfig.hint_generator hook.
- ADR-010: FeatureDeletionEnv synthetic-data subsystem inverting OSS SWE
substrates (SWE-bench-Lite/SWE-Gym/R2E-Gym/SWE-rebench) + online difficulty
gate + reward-hacking safeguards.

ADR-006 patched: Amended-by header + stale matrix row qualified. Index added.

Prior-ADR audit (adr-methodology 3a): 008 amends 006 / preserves 007;
009+010 preserve all. Greenfield-check finding: ComposerReplicationTrainer
already EXISTS+complete (008 = validate+config, not author); hint_generator.py
exists as flat dispatch (009 = extend to layered); FeatureDeletionEnv absent
(010 = build).

Note: cross-family adversarial ADR review was blocked by OpenRouter 402
(out of credits); substituted a focused self-administered pre-mortem
(found + fixed ADR-006 stale-matrix-row). Re-run cross-family review when
credits restored.

docs/adrs/ADR-006-rl-frameworks.md CHANGED
@@ -3,6 +3,7 @@
3
  **Status**: Accepted
4
  **Date**: 2026-05-26
5
  **Wave**: 13
 
6
 
7
  ## Context
8
 
@@ -100,7 +101,7 @@ rewarder) maps naturally onto Monarch primitives.
100
  |---|---|
101
  | Quick start, single-cluster, ≤7B | TRL Recipe A |
102
  | Production multi-node, ≤32B | VeRL Recipe B |
103
- | Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) |
104
  | Coordination-heavy multi-actor RL | Monarch + any of the above |
105
 
106
  ### Trade-offs explicitly accepted
 
3
  **Status**: Accepted
4
  **Date**: 2026-05-26
5
  **Wave**: 13
6
+ **Amended-by**: ADR-008 — the implication that any of the three recipes can host the full 3-channel loss is amended: the **SDPO channel requires full vocabulary logits and is TRL-hosted only**. PRIME-RL hosts channels 1+3 (PG + trace-replay-DPO); its `LossInputs` exposes log-probs, not logits, so `recipes/prime_rl/composer_loss.py` raises `NotImplementedError` for `alpha_sdpo>0` until upstream PRIME-RL exposes logits. The framework selection, Monarch decision, and three-recipe matrix below are preserved.
7
 
8
  ## Context
9
 
 
101
  |---|---|
102
  | Quick start, single-cluster, ≤7B | TRL Recipe A |
103
  | Production multi-node, ≤32B | VeRL Recipe B |
104
+ | Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) — channels 1+3 only; **SDPO channel TRL-hosted, see ADR-008** |
105
  | Coordination-heavy multi-actor RL | Monarch + any of the above |
106
 
107
  ### Trade-offs explicitly accepted
docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ status: proposed
3
+ date: 2026-05-29
4
+ amends: ADR-006
5
+ deciders: [Codeseys, ARIA]
6
+ ---
7
+
8
+ # ADR-008: Target Dr. GRPO as the RL base and host the live SDPO channel in the TRL trainer
9
+
10
+ ## Context and Problem Statement
11
+
12
+ Composer 2.5's "Targeted RL with Textual Feedback" (= SDPO / on-policy
13
+ self-distillation) is the recipe's secret sauce. The framework already
14
+ *proved the SDPO loss math end-to-end* (`compose_loss` + a real
15
+ forward/backward/optimizer step on Qwen2.5-0.5B, `examples/sdpo_real_trace_train_smoke`),
16
+ and `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer`
17
+ already overrides `_compute_loss` to add `α·sdpo + β·replay` on top of the
18
+ parent GRPO loss. What is **not** done: (a) the trainer has never been
19
+ instantiated against a real `trl.GRPOTrainer` and smoke-tested; (b) the
20
+ parent RL objective is unspecified ("GRPO + DAPO patches" in the docstring);
21
+ (c) the SDPO channel carries a documented trust-gap (composer_trainer.py
22
+ lines 158-160 assume the collator aligns student/teacher lengths).
23
+
24
+ Research doc `research/10-composer2-techreport-mining.md` (mining the newly
25
+ available Composer 2 technical report, arXiv:2603.24477) **resolves the RL
26
+ algorithm**: it is **Dr. GRPO** — GRPO with the length-standardization term
27
+ removed, **no std-dev advantage normalization**, Adam, single-epoch (a
28
+ prompt is never trained twice), KL-to-reference via the **k1 estimator
29
+ (`−log r`)** not k3, with DAPO overlong-masking explicitly tried and
30
+ rejected. Research doc `research/08-sdpo-grpo-integration.md` establishes
31
+ that the SDPO channel needs **full vocabulary logits** and therefore must
32
+ live in the TRL `GRPOTrainer` subclass — the PRIME-RL recipe's `LossInputs`
33
+ exposes per-token log-probs only and `recipes/prime_rl/composer_loss.py`
34
+ correctly raises `NotImplementedError` for `alpha_sdpo>0`.
35
+
36
+ ### Relationship to ADR-006 (RL framework strategy) — `amends`
37
+
38
+ ADR-006 chose TRL + VeRL + PRIME-RL as three co-equal RL recipe hosts and
39
+ its three-recipe production matrix **remains in force**. This ADR amends one
40
+ specific point: ADR-006 implied any of the three recipes could host the full
41
+ 3-channel loss. That is **not true for the SDPO channel** — it requires full
42
+ logits, which only the TRL host exposes. This ADR records that the **SDPO
43
+ channel is TRL-hosted**; PRIME-RL hosts channels 1+3 (PG + trace-replay-DPO)
44
+ and is blocked on SDPO until upstream PRIME-RL exposes logits in `LossInputs`.
45
+ ADR-006's framework selection, Monarch decision, and matrix are preserved.
46
+
47
+ ### Relationship to ADR-007 (self-distillation losses) — `preserves`
48
+
49
+ ADR-007 decided *which* distillation losses exist (`generalized_jsd_loss`,
50
+ SimPO, TAID, Entropy-Aware OPD) and their `compose_loss` kwarg surface. This
51
+ ADR is about *wiring the SDPO JSD live into a Dr. GRPO rollout→update loop*
52
+ — orthogonal. ADR-007 is fully preserved; the SDPO channel here calls the
53
+ exact `generalized_jsd_loss` ADR-007 governs.
54
+
55
+ ## Decision Drivers
56
+
57
+ - The actual Composer RL algorithm is now known (Dr. GRPO); we should match
58
+ it rather than ship an unspecified "GRPO + DAPO patches" base.
59
+ - SDPO needs full logits → host constraint is real, not preference.
60
+ - The trainer class already exists; the cost is validation + Dr-GRPO config
61
+ + closing the alignment trust-gap, not authoring from scratch.
62
+ - Must be CPU-smoke-testable so we validate without GPU spend.
63
+
64
+ ## Considered Options
65
+
66
+ - **A. Target Dr. GRPO, host SDPO in the TRL `ComposerReplicationTrainer`** (chosen)
67
+ - **B. Keep the unspecified "GRPO + DAPO" base, host SDPO in PRIME-RL**
68
+ - **C. Vanilla GRPO base (with std-norm + length-standardization), TRL host**
69
+
70
+ ## Decision Outcome
71
+
72
+ Chosen: **Option A** — configure the parent TRL `GRPOTrainer` to the Dr. GRPO
73
+ variant (scale_rewards off / no std-norm, no length-standardization, k1 KL,
74
+ Adam, single-epoch via `num_iterations=1`) via a `make_dr_grpo_config` helper,
75
+ keep the SDPO + trace-replay channels in `ComposerReplicationTrainer`, close
76
+ the student/teacher alignment trust-gap with an explicit assertion + collator
77
+ guarantee, and add a CPU smoke test that instantiates the trainer and runs a
78
+ 1–2 rollout Dr. GRPO step with the SDPO channel on.
79
+
80
+ ### Consequences
81
+
82
+ - **Positive**: Matches the *actual* Composer 2 RL algorithm (primary-sourced), not a guess.
83
+ - **Positive**: Reuses the already-written trainer; small, validatable delta.
84
+ - **Positive**: SDPO channel is exercised live, closing the gap between "loss math proven" and "loss runs inside RL."
85
+ - **Negative**: SDPO is TRL-only for now; the PRIME-RL decentralized path (ADR-006's headline for DiLoCo-shape runs) cannot use the SDPO channel until upstream exposes logits. This narrows the decentralized story to channels 1+3.
86
+ - **Negative**: A second teacher forward per error-site minibatch ~doubles the forward cost on error-bearing batches (acceptable; error batches are a minority and the teacher forward is `no_grad`).
87
+ - **Neutral**: TRL's GRPOTrainer config knobs for Dr. GRPO (e.g. `scale_rewards`) are version-sensitive; pinned + guarded.
88
+
89
+ ## Pros and Cons of the Options
90
+
91
+ ### A. Dr. GRPO + TRL host
92
+ - Good: primary-sourced algorithm match; reuses existing trainer; full logits available for SDPO.
93
+ - Good: CPU-smoke-testable end-to-end.
94
+ - Bad: SDPO confined to TRL until PRIME-RL exposes logits.
95
+
96
+ ### B. Unspecified base + PRIME-RL host
97
+ - Good: PRIME-RL is the decentralized/DiLoCo-shape host (ADR-006).
98
+ - Bad: **PRIME-RL cannot host SDPO** (log-probs only) — `NotImplementedError` today. Fatal for the SDPO channel.
99
+ - Bad: leaves the RL algorithm unspecified despite the tech report resolving it.
100
+
101
+ ### C. Vanilla GRPO + TRL host
102
+ - Good: TRL default; least config work.
103
+ - Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
104
+
105
+ ## Acceptance gate (must be green before status flips to accepted)
106
+
107
+ - [ ] `make_dr_grpo_config(...)` helper exists and sets: no std-norm advantage scaling, no length-standardization, Adam, `num_iterations=1` (single-epoch), k1 KL — each value asserted in a unit test against the resulting `GRPOConfig`.
108
+ - [ ] `ComposerReplicationTrainer._compute_sdpo_loss` lines 158-160 trust-gap closed: an explicit shape-alignment assertion (not just a warning-and-skip) with a unit test that a misaligned batch raises rather than silently zeroes.
109
+ - [ ] CPU smoke test instantiates `ComposerReplicationTrainer` with a real `trl.GRPOTrainer` parent on Qwen2.5-0.5B, runs ≥1 Dr. GRPO update step with `alpha_sdpo>0` on an error-bearing batch, asserts: total loss finite, `loss/sdpo_kl > 0` logged on ≥1 step, a param moved. (Mirrors `examples/sdpo_real_trace_train_smoke`.)
110
+ - [ ] TRL version pinned in `[train]` extra; the Dr-GRPO config knobs guarded with a version check that fails loudly if the knob names drift.
111
+ - [ ] `recipes/prime_rl/composer_loss.py` `NotImplementedError(alpha_sdpo>0)` path has a test asserting it raises with a message pointing to the TRL host (documents the ADR-006 amendment in code).
112
+
113
+ ## More Information
114
+
115
+ - `research/08-sdpo-grpo-integration.md` — full integration design + `ComposerGRPOTrainer` sketch.
116
+ - `research/10-composer2-techreport-mining.md` — Dr. GRPO resolution (arXiv:2603.24477).
117
+ - Dr. GRPO: Liu et al., *Understanding R1-Zero-like training*, arXiv:2503.20783.
118
+ - SDPO: Hübotter et al., arXiv:2601.20802; OPSD: Zhao et al., arXiv:2601.18734.
docs/adrs/ADR-009-layered-hint-generator.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ status: proposed
3
+ date: 2026-05-29
4
+ deciders: [Codeseys, ARIA]
5
+ ---
6
+
7
+ # ADR-009: Adopt a layered HintGenerator architecture for SDPO textual feedback
8
+
9
+ ## Context and Problem Statement
10
+
11
+ Composer 2.5's targeted-textual-feedback method inserts a short natural-language
12
+ hint at each error turn; the hint-conditioned forward pass becomes the SDPO
13
+ teacher. **How Cursor generates that hint is unstated in every Cursor artifact**
14
+ — both blogs and the full Composer 2 technical report
15
+ (`research/10-composer2-techreport-mining.md` confirms ABSENT). The cited papers
16
+ bracket the answer: OPSD (arXiv:2601.18734) conditions the teacher on
17
+ ground-truth/reference; SDPO (arXiv:2601.20802) generalizes to environment
18
+ feedback and ablates three feedback sources, including the "successful sibling
19
+ rollout as implicit feedback" trick. So hint generation is genuinely *our*
20
+ design problem, and the project's own audit calls it "the single most important
21
+ open question for replication."
22
+
23
+ The framework today has `composer_replication/hint_generator.py`: a flat
24
+ registry of 5 templates (`tool_not_found`, `json_decode`, `type_error`,
25
+ `runtime_error`, `repeated_failure`) behind a `dispatch(error_kind, ctx)`
26
+ function. This covers the easy tool-error case but: (a) is not a typed
27
+ interface the collator can compose against (the collator's
28
+ `CollatorConfig.hint_generator` hook takes a callable, but there's no
29
+ `Protocol`); (b) has no path for style/communication errors (which need
30
+ natural language, not templates); (c) has no fallback when no template matches;
31
+ (d) cannot use the SDPO sibling-bootstrap lever. Research doc
32
+ `research/07-sdpo-hint-generator.md` designs the upgrade.
33
+
34
+ ### Relationship to existing code — `preserves` + extends
35
+
36
+ This ADR **preserves** the existing 5 templates and `dispatch`/`register`
37
+ functions (they become the `TemplateHintGenerator` layer) and the
38
+ `CollatorConfig.hint_generator` hook signature. It adds a typed
39
+ `HintGenerator` Protocol and composite layering on top. No prior ADR governs
40
+ hint generation, so there is nothing to supersede.
41
+
42
+ ## Decision Drivers
43
+
44
+ - Hint generation is the #1 replication gap; the design must be honest about
45
+ which behavior classes each layer can actually cover.
46
+ - Templates are free and cover tool errors; style/communication need an LLM;
47
+ some sites have no external hint source at all.
48
+ - Must slot into the existing collator hook with zero collator changes.
49
+ - Cost-awareness: most error sites should hit the free template path.
50
+
51
+ ## Considered Options
52
+
53
+ - **A. Layered HintGenerator: template → raw-error → LLM-judge → SDPO sibling-bootstrap, behind a typed Protocol** (chosen)
54
+ - **B. Keep flat template `dispatch` only** (status quo)
55
+ - **C. LLM-judge only (one strong model generates every hint)**
56
+
57
+ ## Decision Outcome
58
+
59
+ Chosen: **Option A** — a `HintGenerator` Protocol with `.generate(error_context) -> str | None`,
60
+ implemented as a `CompositeHintGenerator` that tries layers cheapest-first:
61
+ (1) `TemplateHintGenerator` (the existing 5 templates + more), (2) raw
62
+ tool-error text as the hint, (3) `LLMJudgeHintGenerator` (OpenRouter, ~$0.0005/site,
63
+ for style/communication/effort sites templates can't cover), (4) SDPO
64
+ sibling-bootstrap (use a successful sibling rollout as implicit feedback when
65
+ no external hint source exists). The composite exposes `as_collator_hook()`
66
+ returning a callable matching the existing `CollatorConfig.hint_generator`
67
+ signature.
68
+
69
+ ### Consequences
70
+
71
+ - **Positive**: Covers all four Composer behavior classes (tool use / style / communication / effort), not just tool errors.
72
+ - **Positive**: Cost-bounded — free template path handles the majority; LLM-judge only on the residual.
73
+ - **Positive**: Zero collator change (drops into the existing hook).
74
+ - **Negative**: The LLM-judge layer introduces a network dependency + per-site cost + nondeterminism into data prep; must be optional and cached. Mitigated by ordering it after the free layers and adding a disk cache keyed on the error context hash.
75
+ - **Negative**: Sibling-bootstrap requires multiple rollouts per prompt to have a successful sibling — only available in the RL-rollout path, not in offline-trace ingestion. Documented as RL-path-only.
76
+ - **Neutral**: More moving parts than a flat dict; mitigated by the Protocol + per-layer unit tests.
77
+
78
+ ## Pros and Cons of the Options
79
+
80
+ ### A. Layered Protocol
81
+ - Good: honest behavior-class coverage; cost-bounded; composable; testable per layer.
82
+ - Good: preserves existing templates as the first layer.
83
+ - Bad: more surface area; LLM layer adds nondeterminism (cached + optional).
84
+
85
+ ### B. Flat templates only (status quo)
86
+ - Good: free, deterministic, already shipping.
87
+ - Bad: cannot cover style/communication/effort sites at all — exactly the behaviors Composer's blog says the method targets.
88
+
89
+ ### C. LLM-judge only
90
+ - Good: maximal coverage, natural language for every site.
91
+ - Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
92
+
93
+ ## Acceptance gate (must be green before status flips to accepted)
94
+
95
+ - [ ] `HintGenerator` Protocol defined with `.generate(error_context: HintContext) -> str | None`; `mypy`/pyright clean.
96
+ - [ ] `TemplateHintGenerator` wraps the existing 5 templates; a test asserts byte-identical output to the current `dispatch()` for all 5 kinds (no regression).
97
+ - [ ] `CompositeHintGenerator` tries layers in cost order; a test asserts a tool_not_found site is served by the template layer (no LLM call) and a style site falls through to the judge layer (mocked).
98
+ - [ ] `LLMJudgeHintGenerator` has a disk cache keyed on error-context hash; a test asserts a second identical call hits the cache (zero network).
99
+ - [ ] `as_collator_hook()` returns a callable accepted by `CollatorConfig.hint_generator` without collator changes; an end-to-end test runs ingestion → collator-with-composite-hook → a non-empty `sdpo_loss_mask` on a real error trace.
100
+ - [ ] Sibling-bootstrap layer is gated behind an explicit `enable_sibling_bootstrap` flag and documented as RL-rollout-path-only; a unit test asserts it returns None in the offline-trace path.
101
+
102
+ ## More Information
103
+
104
+ - `research/07-sdpo-hint-generator.md` — full taxonomy, template strings, judge prompt, cost analysis.
105
+ - `research/09-composer-blog-delta-2026.md` — the SDPO sibling-bootstrap lever.
106
+ - Existing: `composer_replication/hint_generator.py`, `composer_replication/trainer/data_collator.py` (`CollatorConfig.hint_generator`).
docs/adrs/ADR-010-feature-deletion-datagen.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ status: proposed
3
+ date: 2026-05-29
4
+ deciders: [Codeseys, ARIA]
5
+ ---
6
+
7
+ # ADR-010: Build a FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates
8
+
9
+ ## Context and Problem Statement
10
+
11
+ Composer 2.5 trained on "25× more synthetic tasks than Composer 2," with
12
+ **Feature Deletion** as a named generator: take a repo with passing tests,
13
+ delete code/features, the agent must reimplement to make the tests pass —
14
+ tests are the verifiable reward. The Cursor blog also notes the data is a
15
+ **dynamic difficulty curriculum** ("we both select for and create harder
16
+ tasks dynamically throughout the run"); the Composer 2 tech report adds the
17
+ curriculum is keyed on rollout #turns + thinking-token count
18
+ (`research/10-composer2-techreport-mining.md`). The blog reports real
19
+ reward-hacking (decompiling Java bytecode, reverse-engineering Python
20
+ type-check caches to recover deleted signatures) mitigated by "agentic
21
+ monitoring tools."
22
+
23
+ The framework has **no synthetic-data-generation subsystem at all** — this is
24
+ genuinely greenfield. Codeseys' ask is to bring Composer's dataset-generation
25
+ approach into the framework as a real, reusable subsystem (useful beyond this
26
+ project). Research doc `research/06-feature-deletion-datagen.md` designs it
27
+ and identifies ready OSS substrates: SWE-bench-Lite, SWE-Gym (2.4k), R2E-Gym
28
+ (8.1k), SWE-rebench (21.3k), Nemotron/OpenHands (59k) — each ships
29
+ `(repo, gold_patch, FAIL_TO_PASS, PASS_TO_PASS, test_cmd)` tuples that invert
30
+ directly into Feature-Deletion tasks (revert the gold patch → broken repo;
31
+ `FAIL_TO_PASS` tests = the reward signal; `PASS_TO_PASS` = the don't-break-
32
+ existing-behavior guard).
33
+
34
+ ### Relationship to prior ADRs — `preserves`
35
+
36
+ No prior ADR governs synthetic data generation. ADR-002 (trace source) chose
37
+ Claude Code JSONL for *ingestion of real traces* — orthogonal to *generating*
38
+ tasks. This ADR preserves everything; it adds a new `composer_replication.datagen`
39
+ package.
40
+
41
+ ## Decision Drivers
42
+
43
+ - Reuse existing verified OSS substrates rather than scraping repos from scratch — they already guarantee test-exercises-the-code via `FAIL_TO_PASS`.
44
+ - The env must fit the framework's RL side (TRL GRPOTrainer + verifiers / OpenEnv-compatible) so generated tasks feed the same training loop.
45
+ - Reward-hacking is a *real* reported failure, not theoretical — safeguards must be in the design, not deferred.
46
+ - The online difficulty curriculum is part of the recipe, not optional.
47
+ - Subsystem should be reusable for the owner's other project (general "verifiable-reward task generation over a code repo").
48
+
49
+ ## Considered Options
50
+
51
+ - **A. `FeatureDeletionEnv` that inverts OSS SWE substrates (revert gold patch) + online pass-rate difficulty gate + sandbox/AST reward-hacking safeguards** (chosen)
52
+ - **B. Greenfield repo-scraping generator (clone arbitrary GitHub repos, delete AST nodes, hope tests cover them)**
53
+ - **C. Skip generation; reuse SWE-bench-lite tasks as-is without a deletion/inversion layer**
54
+
55
+ ## Decision Outcome
56
+
57
+ Chosen: **Option A** — a `composer_replication.datagen` package with a
58
+ `FeatureDeletionTask` dataclass, a `FeatureDeletionEnv` (Gym/OpenEnv-style
59
+ `reset`/`step`/`reward` where reward = masked test-pass fraction), substrate
60
+ adapters that invert the 5 OSS datasets by reverting their gold patch, a
61
+ 4-gate solvability validator (broken repo fails `FAIL_TO_PASS`, passes
62
+ `PASS_TO_PASS`, gold patch restores green, deletion is reachable from tests),
63
+ an online pass-rate difficulty gate, and reward-hacking safeguards
64
+ (pre-task scrub of `__pycache__`/`.mypy_cache`/`.class`/`.git`; allowlisted
65
+ sandbox without `find`/`strings`/`unzip`/decompilers; AST provenance monitor
66
+ that masks reward when deleted symbols reappear via non-implementation paths).
67
+ A TRL `reward_fn(prompts, completions, **kwargs) -> list[float]` adapter wires
68
+ it to the RL loop.
69
+
70
+ ### Consequences
71
+
72
+ - **Positive**: Verifiable reward for free (tests already exist + are known to exercise the code via `FAIL_TO_PASS`); no need to generate or trust new tests.
73
+ - **Positive**: Reusable general subsystem — "invert a solved-repo dataset into a reimplement-to-pass task" works for the owner's other project.
74
+ - **Positive**: Online difficulty gate matches the actual recipe.
75
+ - **Negative**: Bounded to what the OSS substrates cover (Python-dominant; SWE-bench is Python/JS-heavy). Other languages need new substrates. Documented as a known coverage limit.
76
+ - **Negative**: Running tests in a sandbox requires Docker images per substrate; CPU-pool generation has real wall-clock cost (~15 node-days to invert all 21k SWE-rebench tasks per research/06). Mitigated by reusing the substrates' published Docker images and generating lazily.
77
+ - **Negative**: Reward-hacking safeguards are a moving target; the AST provenance monitor is heuristic and will have false negatives. Mitigated by treating it as defense-in-depth (sandbox lockdown is the primary control) and logging suspected hacks for review.
78
+ - **Neutral**: Adds a `[datagen]` optional extra (datasets, docker SDK).
79
+
80
+ ## Pros and Cons of the Options
81
+
82
+ ### A. Invert OSS substrates
83
+ - Good: tests guaranteed to exercise the deleted code; verified datasets; reusable; matches recipe.
84
+ - Bad: language/domain coverage bounded by the substrates; sandbox/Docker cost.
85
+
86
+ ### B. Greenfield repo scraping
87
+ - Good: unbounded repo universe.
88
+ - Bad: no guarantee remaining tests exercise the deleted code (the hard part SWE-bench already solved); huge validation burden; reinvents SWE-Gym. High effort, low marginal value.
89
+
90
+ ### C. Reuse SWE-bench-lite as-is
91
+ - Good: zero build.
92
+ - Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
93
+
94
+ ## Acceptance gate (must be green before status flips to accepted)
95
+
96
+ - [ ] `FeatureDeletionTask` dataclass + `FeatureDeletionEnv` (`reset`/`step`/`reward`) implemented; reward = masked test-pass fraction with a unit test on a synthetic mini-repo.
97
+ - [ ] One substrate adapter (SWE-bench-Lite, smallest) inverts ≥1 real task: revert gold patch → broken repo, a test asserts the broken repo FAILS `FAIL_TO_PASS` and PASSES `PASS_TO_PASS`, and applying the gold patch restores green. (Runs in the substrate's Docker image; gated `skipif` on docker availability for CI.)
98
+ - [ ] 4-gate solvability validator implemented; a test asserts a task with an unreachable deletion (no test exercises it) is rejected.
99
+ - [ ] Reward-hacking safeguard: a test asserts the sandbox lacks `find`/`strings`/`unzip` and that `__pycache__`/`.mypy_cache` are scrubbed pre-task; the AST provenance monitor masks reward on a crafted "symbol reappears via import of a sibling cache" hack.
100
+ - [ ] Online difficulty gate: a unit test asserts tasks are rankable by a difficulty signal (turns/thinking-token proxy) and the gate up-weights the hard tail.
101
+ - [ ] TRL `reward_fn(prompts, completions, **kwargs) -> list[float]` adapter exists; a test asserts it returns one float in [0,1] per completion = test-pass fraction.
102
+ - [ ] `[datagen]` optional extra added to `pyproject.toml`; `pip install -e .[datagen]` resolves.
103
+
104
+ ## More Information
105
+
106
+ - `research/06-feature-deletion-datagen.md` — full design: substrates w/ HF ids + licenses, deletion mechanics, safeguards, env + reward_fn sketches, cost verdict.
107
+ - Substrates: SWE-bench (arXiv:2310.06770), SWE-Gym (arXiv:2412.21139), R2E-Gym, SWE-rebench.
108
+ - Composer data-gen: `docs/COMPOSER_RECIPE_MAPPING.md` §2, `research/10-composer2-techreport-mining.md` §1.2 (curriculum).
docs/adrs/README.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture Decision Records
2
+
3
+ | # | Title | Status | Date |
4
+ |---|-------|--------|------|
5
+ | [ADR-001](ADR-001-gpu-venue.md) | GPU venue | accepted | — |
6
+ | [ADR-002](ADR-002-trace-source.md) | Trace source | accepted | — |
7
+ | [ADR-003](ADR-003-diloco-impl.md) | DiLoCo implementation | accepted | — |
8
+ | [ADR-004](ADR-004-replaysim-normalization.md) | ReplaySim normalization | accepted | — |
9
+ | [ADR-005](ADR-005-serverless-diloco.md) | Serverless DiLoCo | accepted | — |
10
+ | [ADR-006](ADR-006-rl-frameworks.md) | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 |
11
+ | [ADR-007](ADR-007-self-distillation-losses.md) | Self-distillation losses landscape | accepted | 2026-05-26 |
12
+ | [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | proposed | 2026-05-29 |
13
+ | [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | proposed | 2026-05-29 |
14
+ | [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | proposed | 2026-05-29 |
15
+
16
+ Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.