Codeseys
/

composer-replication-framework

+# Wave: Composer 2.5 data-gen + targeted-RL textual-feedback (2026-05-29)
+> Deep-work-loop wave bringing Composer 2.5's **dataset generation** and
+> **targeted RL with textual feedback** into the framework, per Codeseys'
+> directive. This doc encapsulates what shipped and ties it to the ask.
+## The ask (verbatim intent)
+> "make sure that our framework for data-generation and massive deep RL similar
+> to composer 2.5 is able to … train a model like nanochat on modal … also …
+> diloco … bring the composer 2.5 blog's discussion on dataset generation into
+> the fray … the composer 2.5 RL method of targeted RL with textual feedback
+> should also be included in the RL framework as well as the dataset generation
+> framework."
+## What shipped (all on `hf/master`)
+| Capability | Status | Artifact |
+|---|---|---|
+| **Targeted RL w/ textual feedback** (SDPO) live in a Dr. GRPO loop | ✅ ADR-008 accepted | `trainer/composer_trainer.py` (`make_dr_grpo_config`, strict SDPO alignment), `examples/composer_grpo_sdpo_smoke/` |
+| **Hint generation** (the unstated #1 gap) — layered generator | ✅ ADR-009 accepted | `hint_generator.py` (Template→RawError→LLMJudge→sibling-bootstrap, `as_collator_hook`) |
+| **Dataset generation** (Feature Deletion) | ✅ ADR-010 accepted | `composer_replication/datagen/` (env, sandbox+safeguards, monitor, curriculum, validator, SWE-substrate adapter) |
+| nanochat-on-Modal + DiLoCo A/B | ✅ (prior wave, same session) | DiLoCo tied vanilla within noise; SFT ChatCORE 0.3076; RL Pass@1 doubled |
+| SDPO loss math end-to-end on real traces | ✅ (prior wave, same session) | `examples/sdpo_real_trace_train_smoke/` |
+## Key research resolutions (5 docs, `research/06-10`)
+- **RL algorithm = Dr. GRPO** (Composer 2 tech report arXiv:2603.24477): length-standardization removed, no std-dev advantage norm, Adam, single-epoch, k1 (`−log r`) KL. TRL 1.5.0 supports this natively (`loss_type="dr_grpo"`, `scale_rewards="none"`).
+- **Hint generation is unstated in every Cursor artifact** — so the SDPO/OPSD reconstruction is the legitimate path, not a missed detail. SDPO's "successful-rollout-as-implicit-feedback" is the bootstrap fallback.
+- **Composer 2 hint-free behavior-shaping alternative** documented (aux scalar rewards + nonlinear length/effort penalty) for when hints aren't available.
+- Corrections to the mapping doc: optimizer Adam (not Muon — 2.5-blog-only); sharding FSDP+CP+decoupled-EP (not HSDP).
+## Test posture
+- 187 passed / 16 skipped across the full package (no regressions).
+- New: 7 (Dr.GRPO config + SDPO alignment), 12 (layered hints), 19 (datagen).
+- The SDPO channel is proven both as loss-math-on-real-traces AND wired into a live `trl.GRPOTrainer` Dr. GRPO step.
+## Reusable for the adjacent project
+`composer_replication.datagen` is the data-gen primitive Codeseys wanted for
+another project: **"invert a solved-repo dataset into a reimplement-to-pass
+verifiable-reward task,"** with reward-hacking safeguards and an online
+difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
+(SWE-Gym, R2E-Gym, SWE-rebench — tens of thousands of tasks) a drop-in source.
+## Owed / unblocked-by (constraints this wave hit)
+- **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
+  V4 Pro per model-roster) — deferred: OpenRouter ran out of credits mid-wave,
+  blocking `delegate_task`. A focused single-author pre-mortem was substituted
+  (caught + fixed the ADR-006 stale-matrix-row). Re-run when credits restored.
+- **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
+  image, run the 4 gates against it) — wired but deferred (no Docker in the CPU
+  dev env). Marked `[~]` in the ADR.
+- Gateway restarted 6× during the wave; all work kept durable via
+  phase-boundary commits + detached `systemd-run` scopes.
+## Commit trail
+`7090729` (SDPO smoke) → `6049d00` (research) → `36ab61e` (ADRs) →
+`bde5c5e` (ADR-008 code) → `2a34df4` (ADR-008 smoke+accept) →
+`84740d4` (ADR-009) → `9336af3` (ADR-010).