Codeseys commited on
Commit
8498e8f
·
1 Parent(s): 9336af3

docs: wave summary — Composer 2.5 data-gen + targeted-RL textual-feedback

Browse files
docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Wave: Composer 2.5 data-gen + targeted-RL textual-feedback (2026-05-29)
2
+
3
+ > Deep-work-loop wave bringing Composer 2.5's **dataset generation** and
4
+ > **targeted RL with textual feedback** into the framework, per Codeseys'
5
+ > directive. This doc encapsulates what shipped and ties it to the ask.
6
+
7
+ ## The ask (verbatim intent)
8
+
9
+ > "make sure that our framework for data-generation and massive deep RL similar
10
+ > to composer 2.5 is able to … train a model like nanochat on modal … also …
11
+ > diloco … bring the composer 2.5 blog's discussion on dataset generation into
12
+ > the fray … the composer 2.5 RL method of targeted RL with textual feedback
13
+ > should also be included in the RL framework as well as the dataset generation
14
+ > framework."
15
+
16
+ ## What shipped (all on `hf/master`)
17
+
18
+ | Capability | Status | Artifact |
19
+ |---|---|---|
20
+ | **Targeted RL w/ textual feedback** (SDPO) live in a Dr. GRPO loop | ✅ ADR-008 accepted | `trainer/composer_trainer.py` (`make_dr_grpo_config`, strict SDPO alignment), `examples/composer_grpo_sdpo_smoke/` |
21
+ | **Hint generation** (the unstated #1 gap) — layered generator | ✅ ADR-009 accepted | `hint_generator.py` (Template→RawError→LLMJudge→sibling-bootstrap, `as_collator_hook`) |
22
+ | **Dataset generation** (Feature Deletion) | ✅ ADR-010 accepted | `composer_replication/datagen/` (env, sandbox+safeguards, monitor, curriculum, validator, SWE-substrate adapter) |
23
+ | nanochat-on-Modal + DiLoCo A/B | ✅ (prior wave, same session) | DiLoCo tied vanilla within noise; SFT ChatCORE 0.3076; RL Pass@1 doubled |
24
+ | SDPO loss math end-to-end on real traces | ✅ (prior wave, same session) | `examples/sdpo_real_trace_train_smoke/` |
25
+
26
+ ## Key research resolutions (5 docs, `research/06-10`)
27
+
28
+ - **RL algorithm = Dr. GRPO** (Composer 2 tech report arXiv:2603.24477): length-standardization removed, no std-dev advantage norm, Adam, single-epoch, k1 (`−log r`) KL. TRL 1.5.0 supports this natively (`loss_type="dr_grpo"`, `scale_rewards="none"`).
29
+ - **Hint generation is unstated in every Cursor artifact** — so the SDPO/OPSD reconstruction is the legitimate path, not a missed detail. SDPO's "successful-rollout-as-implicit-feedback" is the bootstrap fallback.
30
+ - **Composer 2 hint-free behavior-shaping alternative** documented (aux scalar rewards + nonlinear length/effort penalty) for when hints aren't available.
31
+ - Corrections to the mapping doc: optimizer Adam (not Muon — 2.5-blog-only); sharding FSDP+CP+decoupled-EP (not HSDP).
32
+
33
+ ## Test posture
34
+
35
+ - 187 passed / 16 skipped across the full package (no regressions).
36
+ - New: 7 (Dr.GRPO config + SDPO alignment), 12 (layered hints), 19 (datagen).
37
+ - The SDPO channel is proven both as loss-math-on-real-traces AND wired into a live `trl.GRPOTrainer` Dr. GRPO step.
38
+
39
+ ## Reusable for the adjacent project
40
+
41
+ `composer_replication.datagen` is the data-gen primitive Codeseys wanted for
42
+ another project: **"invert a solved-repo dataset into a reimplement-to-pass
43
+ verifiable-reward task,"** with reward-hacking safeguards and an online
44
+ difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
45
+ (SWE-Gym, R2E-Gym, SWE-rebench — tens of thousands of tasks) a drop-in source.
46
+
47
+ ## Owed / unblocked-by (constraints this wave hit)
48
+
49
+ - **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
50
+ V4 Pro per model-roster) — deferred: OpenRouter ran out of credits mid-wave,
51
+ blocking `delegate_task`. A focused single-author pre-mortem was substituted
52
+ (caught + fixed the ADR-006 stale-matrix-row). Re-run when credits restored.
53
+ - **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
54
+ image, run the 4 gates against it) — wired but deferred (no Docker in the CPU
55
+ dev env). Marked `[~]` in the ADR.
56
+ - Gateway restarted 6× during the wave; all work kept durable via
57
+ phase-boundary commits + detached `systemd-run` scopes.
58
+
59
+ ## Commit trail
60
+
61
+ `7090729` (SDPO smoke) → `6049d00` (research) → `36ab61e` (ADRs) →
62
+ `bde5c5e` (ADR-008 code) → `2a34df4` (ADR-008 smoke+accept) →
63
+ `84740d4` (ADR-009) → `9336af3` (ADR-010).