Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs(adr): add ADR-008/009/010 (Dr.GRPO+SDPO, layered hints, FeatureDeletionEnv)
Browse filesPhase-4 of the deep-work-loop. Three proposed ADRs translating the Composer 2.5
data-gen + targeted-textual-feedback research into durable decisions, each with
mechanized acceptance gates (status stays `proposed` until gates green):
- ADR-008: target Dr. GRPO (length-std off, no std-norm, k1 KL, Adam,
single-epoch — per arXiv:2603.24477) and host the live SDPO channel in the
TRL ComposerReplicationTrainer. amends ADR-006 (SDPO needs full logits ->
TRL-only; PRIME-RL hosts channels 1+3). preserves ADR-007.
- ADR-009: layered HintGenerator (template -> raw-error -> LLM-judge ->
SDPO sibling-bootstrap) behind a typed Protocol; preserves the existing
templates + CollatorConfig.hint_generator hook.
- ADR-010: FeatureDeletionEnv synthetic-data subsystem inverting OSS SWE
substrates (SWE-bench-Lite/SWE-Gym/R2E-Gym/SWE-rebench) + online difficulty
gate + reward-hacking safeguards.
ADR-006 patched: Amended-by header + stale matrix row qualified. Index added.
Prior-ADR audit (adr-methodology 3a): 008 amends 006 / preserves 007;
009+010 preserve all. Greenfield-check finding: ComposerReplicationTrainer
already EXISTS+complete (008 = validate+config, not author); hint_generator.py
exists as flat dispatch (009 = extend to layered); FeatureDeletionEnv absent
(010 = build).
Note: cross-family adversarial ADR review was blocked by OpenRouter 402
(out of credits); substituted a focused self-administered pre-mortem
(found + fixed ADR-006 stale-matrix-row). Re-run cross-family review when
credits restored.
|
@@ -3,6 +3,7 @@
|
|
| 3 |
**Status**: Accepted
|
| 4 |
**Date**: 2026-05-26
|
| 5 |
**Wave**: 13
|
|
|
|
| 6 |
|
| 7 |
## Context
|
| 8 |
|
|
@@ -100,7 +101,7 @@ rewarder) maps naturally onto Monarch primitives.
|
|
| 100 |
|---|---|
|
| 101 |
| Quick start, single-cluster, ≤7B | TRL Recipe A |
|
| 102 |
| Production multi-node, ≤32B | VeRL Recipe B |
|
| 103 |
-
| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) |
|
| 104 |
| Coordination-heavy multi-actor RL | Monarch + any of the above |
|
| 105 |
|
| 106 |
### Trade-offs explicitly accepted
|
|
|
|
| 3 |
**Status**: Accepted
|
| 4 |
**Date**: 2026-05-26
|
| 5 |
**Wave**: 13
|
| 6 |
+
**Amended-by**: ADR-008 — the implication that any of the three recipes can host the full 3-channel loss is amended: the **SDPO channel requires full vocabulary logits and is TRL-hosted only**. PRIME-RL hosts channels 1+3 (PG + trace-replay-DPO); its `LossInputs` exposes log-probs, not logits, so `recipes/prime_rl/composer_loss.py` raises `NotImplementedError` for `alpha_sdpo>0` until upstream PRIME-RL exposes logits. The framework selection, Monarch decision, and three-recipe matrix below are preserved.
|
| 7 |
|
| 8 |
## Context
|
| 9 |
|
|
|
|
| 101 |
|---|---|
|
| 102 |
| Quick start, single-cluster, ≤7B | TRL Recipe A |
|
| 103 |
| Production multi-node, ≤32B | VeRL Recipe B |
|
| 104 |
+
| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) — channels 1+3 only; **SDPO channel TRL-hosted, see ADR-008** |
|
| 105 |
| Coordination-heavy multi-actor RL | Monarch + any of the above |
|
| 106 |
|
| 107 |
### Trade-offs explicitly accepted
|
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
status: proposed
|
| 3 |
+
date: 2026-05-29
|
| 4 |
+
amends: ADR-006
|
| 5 |
+
deciders: [Codeseys, ARIA]
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# ADR-008: Target Dr. GRPO as the RL base and host the live SDPO channel in the TRL trainer
|
| 9 |
+
|
| 10 |
+
## Context and Problem Statement
|
| 11 |
+
|
| 12 |
+
Composer 2.5's "Targeted RL with Textual Feedback" (= SDPO / on-policy
|
| 13 |
+
self-distillation) is the recipe's secret sauce. The framework already
|
| 14 |
+
*proved the SDPO loss math end-to-end* (`compose_loss` + a real
|
| 15 |
+
forward/backward/optimizer step on Qwen2.5-0.5B, `examples/sdpo_real_trace_train_smoke`),
|
| 16 |
+
and `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer`
|
| 17 |
+
already overrides `_compute_loss` to add `α·sdpo + β·replay` on top of the
|
| 18 |
+
parent GRPO loss. What is **not** done: (a) the trainer has never been
|
| 19 |
+
instantiated against a real `trl.GRPOTrainer` and smoke-tested; (b) the
|
| 20 |
+
parent RL objective is unspecified ("GRPO + DAPO patches" in the docstring);
|
| 21 |
+
(c) the SDPO channel carries a documented trust-gap (composer_trainer.py
|
| 22 |
+
lines 158-160 assume the collator aligns student/teacher lengths).
|
| 23 |
+
|
| 24 |
+
Research doc `research/10-composer2-techreport-mining.md` (mining the newly
|
| 25 |
+
available Composer 2 technical report, arXiv:2603.24477) **resolves the RL
|
| 26 |
+
algorithm**: it is **Dr. GRPO** — GRPO with the length-standardization term
|
| 27 |
+
removed, **no std-dev advantage normalization**, Adam, single-epoch (a
|
| 28 |
+
prompt is never trained twice), KL-to-reference via the **k1 estimator
|
| 29 |
+
(`−log r`)** not k3, with DAPO overlong-masking explicitly tried and
|
| 30 |
+
rejected. Research doc `research/08-sdpo-grpo-integration.md` establishes
|
| 31 |
+
that the SDPO channel needs **full vocabulary logits** and therefore must
|
| 32 |
+
live in the TRL `GRPOTrainer` subclass — the PRIME-RL recipe's `LossInputs`
|
| 33 |
+
exposes per-token log-probs only and `recipes/prime_rl/composer_loss.py`
|
| 34 |
+
correctly raises `NotImplementedError` for `alpha_sdpo>0`.
|
| 35 |
+
|
| 36 |
+
### Relationship to ADR-006 (RL framework strategy) — `amends`
|
| 37 |
+
|
| 38 |
+
ADR-006 chose TRL + VeRL + PRIME-RL as three co-equal RL recipe hosts and
|
| 39 |
+
its three-recipe production matrix **remains in force**. This ADR amends one
|
| 40 |
+
specific point: ADR-006 implied any of the three recipes could host the full
|
| 41 |
+
3-channel loss. That is **not true for the SDPO channel** — it requires full
|
| 42 |
+
logits, which only the TRL host exposes. This ADR records that the **SDPO
|
| 43 |
+
channel is TRL-hosted**; PRIME-RL hosts channels 1+3 (PG + trace-replay-DPO)
|
| 44 |
+
and is blocked on SDPO until upstream PRIME-RL exposes logits in `LossInputs`.
|
| 45 |
+
ADR-006's framework selection, Monarch decision, and matrix are preserved.
|
| 46 |
+
|
| 47 |
+
### Relationship to ADR-007 (self-distillation losses) — `preserves`
|
| 48 |
+
|
| 49 |
+
ADR-007 decided *which* distillation losses exist (`generalized_jsd_loss`,
|
| 50 |
+
SimPO, TAID, Entropy-Aware OPD) and their `compose_loss` kwarg surface. This
|
| 51 |
+
ADR is about *wiring the SDPO JSD live into a Dr. GRPO rollout→update loop*
|
| 52 |
+
— orthogonal. ADR-007 is fully preserved; the SDPO channel here calls the
|
| 53 |
+
exact `generalized_jsd_loss` ADR-007 governs.
|
| 54 |
+
|
| 55 |
+
## Decision Drivers
|
| 56 |
+
|
| 57 |
+
- The actual Composer RL algorithm is now known (Dr. GRPO); we should match
|
| 58 |
+
it rather than ship an unspecified "GRPO + DAPO patches" base.
|
| 59 |
+
- SDPO needs full logits → host constraint is real, not preference.
|
| 60 |
+
- The trainer class already exists; the cost is validation + Dr-GRPO config
|
| 61 |
+
+ closing the alignment trust-gap, not authoring from scratch.
|
| 62 |
+
- Must be CPU-smoke-testable so we validate without GPU spend.
|
| 63 |
+
|
| 64 |
+
## Considered Options
|
| 65 |
+
|
| 66 |
+
- **A. Target Dr. GRPO, host SDPO in the TRL `ComposerReplicationTrainer`** (chosen)
|
| 67 |
+
- **B. Keep the unspecified "GRPO + DAPO" base, host SDPO in PRIME-RL**
|
| 68 |
+
- **C. Vanilla GRPO base (with std-norm + length-standardization), TRL host**
|
| 69 |
+
|
| 70 |
+
## Decision Outcome
|
| 71 |
+
|
| 72 |
+
Chosen: **Option A** — configure the parent TRL `GRPOTrainer` to the Dr. GRPO
|
| 73 |
+
variant (scale_rewards off / no std-norm, no length-standardization, k1 KL,
|
| 74 |
+
Adam, single-epoch via `num_iterations=1`) via a `make_dr_grpo_config` helper,
|
| 75 |
+
keep the SDPO + trace-replay channels in `ComposerReplicationTrainer`, close
|
| 76 |
+
the student/teacher alignment trust-gap with an explicit assertion + collator
|
| 77 |
+
guarantee, and add a CPU smoke test that instantiates the trainer and runs a
|
| 78 |
+
1–2 rollout Dr. GRPO step with the SDPO channel on.
|
| 79 |
+
|
| 80 |
+
### Consequences
|
| 81 |
+
|
| 82 |
+
- **Positive**: Matches the *actual* Composer 2 RL algorithm (primary-sourced), not a guess.
|
| 83 |
+
- **Positive**: Reuses the already-written trainer; small, validatable delta.
|
| 84 |
+
- **Positive**: SDPO channel is exercised live, closing the gap between "loss math proven" and "loss runs inside RL."
|
| 85 |
+
- **Negative**: SDPO is TRL-only for now; the PRIME-RL decentralized path (ADR-006's headline for DiLoCo-shape runs) cannot use the SDPO channel until upstream exposes logits. This narrows the decentralized story to channels 1+3.
|
| 86 |
+
- **Negative**: A second teacher forward per error-site minibatch ~doubles the forward cost on error-bearing batches (acceptable; error batches are a minority and the teacher forward is `no_grad`).
|
| 87 |
+
- **Neutral**: TRL's GRPOTrainer config knobs for Dr. GRPO (e.g. `scale_rewards`) are version-sensitive; pinned + guarded.
|
| 88 |
+
|
| 89 |
+
## Pros and Cons of the Options
|
| 90 |
+
|
| 91 |
+
### A. Dr. GRPO + TRL host
|
| 92 |
+
- Good: primary-sourced algorithm match; reuses existing trainer; full logits available for SDPO.
|
| 93 |
+
- Good: CPU-smoke-testable end-to-end.
|
| 94 |
+
- Bad: SDPO confined to TRL until PRIME-RL exposes logits.
|
| 95 |
+
|
| 96 |
+
### B. Unspecified base + PRIME-RL host
|
| 97 |
+
- Good: PRIME-RL is the decentralized/DiLoCo-shape host (ADR-006).
|
| 98 |
+
- Bad: **PRIME-RL cannot host SDPO** (log-probs only) — `NotImplementedError` today. Fatal for the SDPO channel.
|
| 99 |
+
- Bad: leaves the RL algorithm unspecified despite the tech report resolving it.
|
| 100 |
+
|
| 101 |
+
### C. Vanilla GRPO + TRL host
|
| 102 |
+
- Good: TRL default; least config work.
|
| 103 |
+
- Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
|
| 104 |
+
|
| 105 |
+
## Acceptance gate (must be green before status flips to accepted)
|
| 106 |
+
|
| 107 |
+
- [ ] `make_dr_grpo_config(...)` helper exists and sets: no std-norm advantage scaling, no length-standardization, Adam, `num_iterations=1` (single-epoch), k1 KL — each value asserted in a unit test against the resulting `GRPOConfig`.
|
| 108 |
+
- [ ] `ComposerReplicationTrainer._compute_sdpo_loss` lines 158-160 trust-gap closed: an explicit shape-alignment assertion (not just a warning-and-skip) with a unit test that a misaligned batch raises rather than silently zeroes.
|
| 109 |
+
- [ ] CPU smoke test instantiates `ComposerReplicationTrainer` with a real `trl.GRPOTrainer` parent on Qwen2.5-0.5B, runs ≥1 Dr. GRPO update step with `alpha_sdpo>0` on an error-bearing batch, asserts: total loss finite, `loss/sdpo_kl > 0` logged on ≥1 step, a param moved. (Mirrors `examples/sdpo_real_trace_train_smoke`.)
|
| 110 |
+
- [ ] TRL version pinned in `[train]` extra; the Dr-GRPO config knobs guarded with a version check that fails loudly if the knob names drift.
|
| 111 |
+
- [ ] `recipes/prime_rl/composer_loss.py` `NotImplementedError(alpha_sdpo>0)` path has a test asserting it raises with a message pointing to the TRL host (documents the ADR-006 amendment in code).
|
| 112 |
+
|
| 113 |
+
## More Information
|
| 114 |
+
|
| 115 |
+
- `research/08-sdpo-grpo-integration.md` — full integration design + `ComposerGRPOTrainer` sketch.
|
| 116 |
+
- `research/10-composer2-techreport-mining.md` — Dr. GRPO resolution (arXiv:2603.24477).
|
| 117 |
+
- Dr. GRPO: Liu et al., *Understanding R1-Zero-like training*, arXiv:2503.20783.
|
| 118 |
+
- SDPO: Hübotter et al., arXiv:2601.20802; OPSD: Zhao et al., arXiv:2601.18734.
|
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
status: proposed
|
| 3 |
+
date: 2026-05-29
|
| 4 |
+
deciders: [Codeseys, ARIA]
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ADR-009: Adopt a layered HintGenerator architecture for SDPO textual feedback
|
| 8 |
+
|
| 9 |
+
## Context and Problem Statement
|
| 10 |
+
|
| 11 |
+
Composer 2.5's targeted-textual-feedback method inserts a short natural-language
|
| 12 |
+
hint at each error turn; the hint-conditioned forward pass becomes the SDPO
|
| 13 |
+
teacher. **How Cursor generates that hint is unstated in every Cursor artifact**
|
| 14 |
+
— both blogs and the full Composer 2 technical report
|
| 15 |
+
(`research/10-composer2-techreport-mining.md` confirms ABSENT). The cited papers
|
| 16 |
+
bracket the answer: OPSD (arXiv:2601.18734) conditions the teacher on
|
| 17 |
+
ground-truth/reference; SDPO (arXiv:2601.20802) generalizes to environment
|
| 18 |
+
feedback and ablates three feedback sources, including the "successful sibling
|
| 19 |
+
rollout as implicit feedback" trick. So hint generation is genuinely *our*
|
| 20 |
+
design problem, and the project's own audit calls it "the single most important
|
| 21 |
+
open question for replication."
|
| 22 |
+
|
| 23 |
+
The framework today has `composer_replication/hint_generator.py`: a flat
|
| 24 |
+
registry of 5 templates (`tool_not_found`, `json_decode`, `type_error`,
|
| 25 |
+
`runtime_error`, `repeated_failure`) behind a `dispatch(error_kind, ctx)`
|
| 26 |
+
function. This covers the easy tool-error case but: (a) is not a typed
|
| 27 |
+
interface the collator can compose against (the collator's
|
| 28 |
+
`CollatorConfig.hint_generator` hook takes a callable, but there's no
|
| 29 |
+
`Protocol`); (b) has no path for style/communication errors (which need
|
| 30 |
+
natural language, not templates); (c) has no fallback when no template matches;
|
| 31 |
+
(d) cannot use the SDPO sibling-bootstrap lever. Research doc
|
| 32 |
+
`research/07-sdpo-hint-generator.md` designs the upgrade.
|
| 33 |
+
|
| 34 |
+
### Relationship to existing code — `preserves` + extends
|
| 35 |
+
|
| 36 |
+
This ADR **preserves** the existing 5 templates and `dispatch`/`register`
|
| 37 |
+
functions (they become the `TemplateHintGenerator` layer) and the
|
| 38 |
+
`CollatorConfig.hint_generator` hook signature. It adds a typed
|
| 39 |
+
`HintGenerator` Protocol and composite layering on top. No prior ADR governs
|
| 40 |
+
hint generation, so there is nothing to supersede.
|
| 41 |
+
|
| 42 |
+
## Decision Drivers
|
| 43 |
+
|
| 44 |
+
- Hint generation is the #1 replication gap; the design must be honest about
|
| 45 |
+
which behavior classes each layer can actually cover.
|
| 46 |
+
- Templates are free and cover tool errors; style/communication need an LLM;
|
| 47 |
+
some sites have no external hint source at all.
|
| 48 |
+
- Must slot into the existing collator hook with zero collator changes.
|
| 49 |
+
- Cost-awareness: most error sites should hit the free template path.
|
| 50 |
+
|
| 51 |
+
## Considered Options
|
| 52 |
+
|
| 53 |
+
- **A. Layered HintGenerator: template → raw-error → LLM-judge → SDPO sibling-bootstrap, behind a typed Protocol** (chosen)
|
| 54 |
+
- **B. Keep flat template `dispatch` only** (status quo)
|
| 55 |
+
- **C. LLM-judge only (one strong model generates every hint)**
|
| 56 |
+
|
| 57 |
+
## Decision Outcome
|
| 58 |
+
|
| 59 |
+
Chosen: **Option A** — a `HintGenerator` Protocol with `.generate(error_context) -> str | None`,
|
| 60 |
+
implemented as a `CompositeHintGenerator` that tries layers cheapest-first:
|
| 61 |
+
(1) `TemplateHintGenerator` (the existing 5 templates + more), (2) raw
|
| 62 |
+
tool-error text as the hint, (3) `LLMJudgeHintGenerator` (OpenRouter, ~$0.0005/site,
|
| 63 |
+
for style/communication/effort sites templates can't cover), (4) SDPO
|
| 64 |
+
sibling-bootstrap (use a successful sibling rollout as implicit feedback when
|
| 65 |
+
no external hint source exists). The composite exposes `as_collator_hook()`
|
| 66 |
+
returning a callable matching the existing `CollatorConfig.hint_generator`
|
| 67 |
+
signature.
|
| 68 |
+
|
| 69 |
+
### Consequences
|
| 70 |
+
|
| 71 |
+
- **Positive**: Covers all four Composer behavior classes (tool use / style / communication / effort), not just tool errors.
|
| 72 |
+
- **Positive**: Cost-bounded — free template path handles the majority; LLM-judge only on the residual.
|
| 73 |
+
- **Positive**: Zero collator change (drops into the existing hook).
|
| 74 |
+
- **Negative**: The LLM-judge layer introduces a network dependency + per-site cost + nondeterminism into data prep; must be optional and cached. Mitigated by ordering it after the free layers and adding a disk cache keyed on the error context hash.
|
| 75 |
+
- **Negative**: Sibling-bootstrap requires multiple rollouts per prompt to have a successful sibling — only available in the RL-rollout path, not in offline-trace ingestion. Documented as RL-path-only.
|
| 76 |
+
- **Neutral**: More moving parts than a flat dict; mitigated by the Protocol + per-layer unit tests.
|
| 77 |
+
|
| 78 |
+
## Pros and Cons of the Options
|
| 79 |
+
|
| 80 |
+
### A. Layered Protocol
|
| 81 |
+
- Good: honest behavior-class coverage; cost-bounded; composable; testable per layer.
|
| 82 |
+
- Good: preserves existing templates as the first layer.
|
| 83 |
+
- Bad: more surface area; LLM layer adds nondeterminism (cached + optional).
|
| 84 |
+
|
| 85 |
+
### B. Flat templates only (status quo)
|
| 86 |
+
- Good: free, deterministic, already shipping.
|
| 87 |
+
- Bad: cannot cover style/communication/effort sites at all — exactly the behaviors Composer's blog says the method targets.
|
| 88 |
+
|
| 89 |
+
### C. LLM-judge only
|
| 90 |
+
- Good: maximal coverage, natural language for every site.
|
| 91 |
+
- Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
|
| 92 |
+
|
| 93 |
+
## Acceptance gate (must be green before status flips to accepted)
|
| 94 |
+
|
| 95 |
+
- [ ] `HintGenerator` Protocol defined with `.generate(error_context: HintContext) -> str | None`; `mypy`/pyright clean.
|
| 96 |
+
- [ ] `TemplateHintGenerator` wraps the existing 5 templates; a test asserts byte-identical output to the current `dispatch()` for all 5 kinds (no regression).
|
| 97 |
+
- [ ] `CompositeHintGenerator` tries layers in cost order; a test asserts a tool_not_found site is served by the template layer (no LLM call) and a style site falls through to the judge layer (mocked).
|
| 98 |
+
- [ ] `LLMJudgeHintGenerator` has a disk cache keyed on error-context hash; a test asserts a second identical call hits the cache (zero network).
|
| 99 |
+
- [ ] `as_collator_hook()` returns a callable accepted by `CollatorConfig.hint_generator` without collator changes; an end-to-end test runs ingestion → collator-with-composite-hook → a non-empty `sdpo_loss_mask` on a real error trace.
|
| 100 |
+
- [ ] Sibling-bootstrap layer is gated behind an explicit `enable_sibling_bootstrap` flag and documented as RL-rollout-path-only; a unit test asserts it returns None in the offline-trace path.
|
| 101 |
+
|
| 102 |
+
## More Information
|
| 103 |
+
|
| 104 |
+
- `research/07-sdpo-hint-generator.md` — full taxonomy, template strings, judge prompt, cost analysis.
|
| 105 |
+
- `research/09-composer-blog-delta-2026.md` — the SDPO sibling-bootstrap lever.
|
| 106 |
+
- Existing: `composer_replication/hint_generator.py`, `composer_replication/trainer/data_collator.py` (`CollatorConfig.hint_generator`).
|
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
status: proposed
|
| 3 |
+
date: 2026-05-29
|
| 4 |
+
deciders: [Codeseys, ARIA]
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ADR-010: Build a FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates
|
| 8 |
+
|
| 9 |
+
## Context and Problem Statement
|
| 10 |
+
|
| 11 |
+
Composer 2.5 trained on "25× more synthetic tasks than Composer 2," with
|
| 12 |
+
**Feature Deletion** as a named generator: take a repo with passing tests,
|
| 13 |
+
delete code/features, the agent must reimplement to make the tests pass —
|
| 14 |
+
tests are the verifiable reward. The Cursor blog also notes the data is a
|
| 15 |
+
**dynamic difficulty curriculum** ("we both select for and create harder
|
| 16 |
+
tasks dynamically throughout the run"); the Composer 2 tech report adds the
|
| 17 |
+
curriculum is keyed on rollout #turns + thinking-token count
|
| 18 |
+
(`research/10-composer2-techreport-mining.md`). The blog reports real
|
| 19 |
+
reward-hacking (decompiling Java bytecode, reverse-engineering Python
|
| 20 |
+
type-check caches to recover deleted signatures) mitigated by "agentic
|
| 21 |
+
monitoring tools."
|
| 22 |
+
|
| 23 |
+
The framework has **no synthetic-data-generation subsystem at all** — this is
|
| 24 |
+
genuinely greenfield. Codeseys' ask is to bring Composer's dataset-generation
|
| 25 |
+
approach into the framework as a real, reusable subsystem (useful beyond this
|
| 26 |
+
project). Research doc `research/06-feature-deletion-datagen.md` designs it
|
| 27 |
+
and identifies ready OSS substrates: SWE-bench-Lite, SWE-Gym (2.4k), R2E-Gym
|
| 28 |
+
(8.1k), SWE-rebench (21.3k), Nemotron/OpenHands (59k) — each ships
|
| 29 |
+
`(repo, gold_patch, FAIL_TO_PASS, PASS_TO_PASS, test_cmd)` tuples that invert
|
| 30 |
+
directly into Feature-Deletion tasks (revert the gold patch → broken repo;
|
| 31 |
+
`FAIL_TO_PASS` tests = the reward signal; `PASS_TO_PASS` = the don't-break-
|
| 32 |
+
existing-behavior guard).
|
| 33 |
+
|
| 34 |
+
### Relationship to prior ADRs — `preserves`
|
| 35 |
+
|
| 36 |
+
No prior ADR governs synthetic data generation. ADR-002 (trace source) chose
|
| 37 |
+
Claude Code JSONL for *ingestion of real traces* — orthogonal to *generating*
|
| 38 |
+
tasks. This ADR preserves everything; it adds a new `composer_replication.datagen`
|
| 39 |
+
package.
|
| 40 |
+
|
| 41 |
+
## Decision Drivers
|
| 42 |
+
|
| 43 |
+
- Reuse existing verified OSS substrates rather than scraping repos from scratch — they already guarantee test-exercises-the-code via `FAIL_TO_PASS`.
|
| 44 |
+
- The env must fit the framework's RL side (TRL GRPOTrainer + verifiers / OpenEnv-compatible) so generated tasks feed the same training loop.
|
| 45 |
+
- Reward-hacking is a *real* reported failure, not theoretical — safeguards must be in the design, not deferred.
|
| 46 |
+
- The online difficulty curriculum is part of the recipe, not optional.
|
| 47 |
+
- Subsystem should be reusable for the owner's other project (general "verifiable-reward task generation over a code repo").
|
| 48 |
+
|
| 49 |
+
## Considered Options
|
| 50 |
+
|
| 51 |
+
- **A. `FeatureDeletionEnv` that inverts OSS SWE substrates (revert gold patch) + online pass-rate difficulty gate + sandbox/AST reward-hacking safeguards** (chosen)
|
| 52 |
+
- **B. Greenfield repo-scraping generator (clone arbitrary GitHub repos, delete AST nodes, hope tests cover them)**
|
| 53 |
+
- **C. Skip generation; reuse SWE-bench-lite tasks as-is without a deletion/inversion layer**
|
| 54 |
+
|
| 55 |
+
## Decision Outcome
|
| 56 |
+
|
| 57 |
+
Chosen: **Option A** — a `composer_replication.datagen` package with a
|
| 58 |
+
`FeatureDeletionTask` dataclass, a `FeatureDeletionEnv` (Gym/OpenEnv-style
|
| 59 |
+
`reset`/`step`/`reward` where reward = masked test-pass fraction), substrate
|
| 60 |
+
adapters that invert the 5 OSS datasets by reverting their gold patch, a
|
| 61 |
+
4-gate solvability validator (broken repo fails `FAIL_TO_PASS`, passes
|
| 62 |
+
`PASS_TO_PASS`, gold patch restores green, deletion is reachable from tests),
|
| 63 |
+
an online pass-rate difficulty gate, and reward-hacking safeguards
|
| 64 |
+
(pre-task scrub of `__pycache__`/`.mypy_cache`/`.class`/`.git`; allowlisted
|
| 65 |
+
sandbox without `find`/`strings`/`unzip`/decompilers; AST provenance monitor
|
| 66 |
+
that masks reward when deleted symbols reappear via non-implementation paths).
|
| 67 |
+
A TRL `reward_fn(prompts, completions, **kwargs) -> list[float]` adapter wires
|
| 68 |
+
it to the RL loop.
|
| 69 |
+
|
| 70 |
+
### Consequences
|
| 71 |
+
|
| 72 |
+
- **Positive**: Verifiable reward for free (tests already exist + are known to exercise the code via `FAIL_TO_PASS`); no need to generate or trust new tests.
|
| 73 |
+
- **Positive**: Reusable general subsystem — "invert a solved-repo dataset into a reimplement-to-pass task" works for the owner's other project.
|
| 74 |
+
- **Positive**: Online difficulty gate matches the actual recipe.
|
| 75 |
+
- **Negative**: Bounded to what the OSS substrates cover (Python-dominant; SWE-bench is Python/JS-heavy). Other languages need new substrates. Documented as a known coverage limit.
|
| 76 |
+
- **Negative**: Running tests in a sandbox requires Docker images per substrate; CPU-pool generation has real wall-clock cost (~15 node-days to invert all 21k SWE-rebench tasks per research/06). Mitigated by reusing the substrates' published Docker images and generating lazily.
|
| 77 |
+
- **Negative**: Reward-hacking safeguards are a moving target; the AST provenance monitor is heuristic and will have false negatives. Mitigated by treating it as defense-in-depth (sandbox lockdown is the primary control) and logging suspected hacks for review.
|
| 78 |
+
- **Neutral**: Adds a `[datagen]` optional extra (datasets, docker SDK).
|
| 79 |
+
|
| 80 |
+
## Pros and Cons of the Options
|
| 81 |
+
|
| 82 |
+
### A. Invert OSS substrates
|
| 83 |
+
- Good: tests guaranteed to exercise the deleted code; verified datasets; reusable; matches recipe.
|
| 84 |
+
- Bad: language/domain coverage bounded by the substrates; sandbox/Docker cost.
|
| 85 |
+
|
| 86 |
+
### B. Greenfield repo scraping
|
| 87 |
+
- Good: unbounded repo universe.
|
| 88 |
+
- Bad: no guarantee remaining tests exercise the deleted code (the hard part SWE-bench already solved); huge validation burden; reinvents SWE-Gym. High effort, low marginal value.
|
| 89 |
+
|
| 90 |
+
### C. Reuse SWE-bench-lite as-is
|
| 91 |
+
- Good: zero build.
|
| 92 |
+
- Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
|
| 93 |
+
|
| 94 |
+
## Acceptance gate (must be green before status flips to accepted)
|
| 95 |
+
|
| 96 |
+
- [ ] `FeatureDeletionTask` dataclass + `FeatureDeletionEnv` (`reset`/`step`/`reward`) implemented; reward = masked test-pass fraction with a unit test on a synthetic mini-repo.
|
| 97 |
+
- [ ] One substrate adapter (SWE-bench-Lite, smallest) inverts ≥1 real task: revert gold patch → broken repo, a test asserts the broken repo FAILS `FAIL_TO_PASS` and PASSES `PASS_TO_PASS`, and applying the gold patch restores green. (Runs in the substrate's Docker image; gated `skipif` on docker availability for CI.)
|
| 98 |
+
- [ ] 4-gate solvability validator implemented; a test asserts a task with an unreachable deletion (no test exercises it) is rejected.
|
| 99 |
+
- [ ] Reward-hacking safeguard: a test asserts the sandbox lacks `find`/`strings`/`unzip` and that `__pycache__`/`.mypy_cache` are scrubbed pre-task; the AST provenance monitor masks reward on a crafted "symbol reappears via import of a sibling cache" hack.
|
| 100 |
+
- [ ] Online difficulty gate: a unit test asserts tasks are rankable by a difficulty signal (turns/thinking-token proxy) and the gate up-weights the hard tail.
|
| 101 |
+
- [ ] TRL `reward_fn(prompts, completions, **kwargs) -> list[float]` adapter exists; a test asserts it returns one float in [0,1] per completion = test-pass fraction.
|
| 102 |
+
- [ ] `[datagen]` optional extra added to `pyproject.toml`; `pip install -e .[datagen]` resolves.
|
| 103 |
+
|
| 104 |
+
## More Information
|
| 105 |
+
|
| 106 |
+
- `research/06-feature-deletion-datagen.md` — full design: substrates w/ HF ids + licenses, deletion mechanics, safeguards, env + reward_fn sketches, cost verdict.
|
| 107 |
+
- Substrates: SWE-bench (arXiv:2310.06770), SWE-Gym (arXiv:2412.21139), R2E-Gym, SWE-rebench.
|
| 108 |
+
- Composer data-gen: `docs/COMPOSER_RECIPE_MAPPING.md` §2, `research/10-composer2-techreport-mining.md` §1.2 (curriculum).
|
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture Decision Records
|
| 2 |
+
|
| 3 |
+
| # | Title | Status | Date |
|
| 4 |
+
|---|-------|--------|------|
|
| 5 |
+
| [ADR-001](ADR-001-gpu-venue.md) | GPU venue | accepted | — |
|
| 6 |
+
| [ADR-002](ADR-002-trace-source.md) | Trace source | accepted | — |
|
| 7 |
+
| [ADR-003](ADR-003-diloco-impl.md) | DiLoCo implementation | accepted | — |
|
| 8 |
+
| [ADR-004](ADR-004-replaysim-normalization.md) | ReplaySim normalization | accepted | — |
|
| 9 |
+
| [ADR-005](ADR-005-serverless-diloco.md) | Serverless DiLoCo | accepted | — |
|
| 10 |
+
| [ADR-006](ADR-006-rl-frameworks.md) | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 |
|
| 11 |
+
| [ADR-007](ADR-007-self-distillation-losses.md) | Self-distillation losses landscape | accepted | 2026-05-26 |
|
| 12 |
+
| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | proposed | 2026-05-29 |
|
| 13 |
+
| [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | proposed | 2026-05-29 |
|
| 14 |
+
| [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | proposed | 2026-05-29 |
|
| 15 |
+
|
| 16 |
+
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|