Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs: wave summary — Composer 2.5 data-gen + targeted-RL textual-feedback
Browse files
docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave: Composer 2.5 data-gen + targeted-RL textual-feedback (2026-05-29)
|
| 2 |
+
|
| 3 |
+
> Deep-work-loop wave bringing Composer 2.5's **dataset generation** and
|
| 4 |
+
> **targeted RL with textual feedback** into the framework, per Codeseys'
|
| 5 |
+
> directive. This doc encapsulates what shipped and ties it to the ask.
|
| 6 |
+
|
| 7 |
+
## The ask (verbatim intent)
|
| 8 |
+
|
| 9 |
+
> "make sure that our framework for data-generation and massive deep RL similar
|
| 10 |
+
> to composer 2.5 is able to … train a model like nanochat on modal … also …
|
| 11 |
+
> diloco … bring the composer 2.5 blog's discussion on dataset generation into
|
| 12 |
+
> the fray … the composer 2.5 RL method of targeted RL with textual feedback
|
| 13 |
+
> should also be included in the RL framework as well as the dataset generation
|
| 14 |
+
> framework."
|
| 15 |
+
|
| 16 |
+
## What shipped (all on `hf/master`)
|
| 17 |
+
|
| 18 |
+
| Capability | Status | Artifact |
|
| 19 |
+
|---|---|---|
|
| 20 |
+
| **Targeted RL w/ textual feedback** (SDPO) live in a Dr. GRPO loop | ✅ ADR-008 accepted | `trainer/composer_trainer.py` (`make_dr_grpo_config`, strict SDPO alignment), `examples/composer_grpo_sdpo_smoke/` |
|
| 21 |
+
| **Hint generation** (the unstated #1 gap) — layered generator | ✅ ADR-009 accepted | `hint_generator.py` (Template→RawError→LLMJudge→sibling-bootstrap, `as_collator_hook`) |
|
| 22 |
+
| **Dataset generation** (Feature Deletion) | ✅ ADR-010 accepted | `composer_replication/datagen/` (env, sandbox+safeguards, monitor, curriculum, validator, SWE-substrate adapter) |
|
| 23 |
+
| nanochat-on-Modal + DiLoCo A/B | ✅ (prior wave, same session) | DiLoCo tied vanilla within noise; SFT ChatCORE 0.3076; RL Pass@1 doubled |
|
| 24 |
+
| SDPO loss math end-to-end on real traces | ✅ (prior wave, same session) | `examples/sdpo_real_trace_train_smoke/` |
|
| 25 |
+
|
| 26 |
+
## Key research resolutions (5 docs, `research/06-10`)
|
| 27 |
+
|
| 28 |
+
- **RL algorithm = Dr. GRPO** (Composer 2 tech report arXiv:2603.24477): length-standardization removed, no std-dev advantage norm, Adam, single-epoch, k1 (`−log r`) KL. TRL 1.5.0 supports this natively (`loss_type="dr_grpo"`, `scale_rewards="none"`).
|
| 29 |
+
- **Hint generation is unstated in every Cursor artifact** — so the SDPO/OPSD reconstruction is the legitimate path, not a missed detail. SDPO's "successful-rollout-as-implicit-feedback" is the bootstrap fallback.
|
| 30 |
+
- **Composer 2 hint-free behavior-shaping alternative** documented (aux scalar rewards + nonlinear length/effort penalty) for when hints aren't available.
|
| 31 |
+
- Corrections to the mapping doc: optimizer Adam (not Muon — 2.5-blog-only); sharding FSDP+CP+decoupled-EP (not HSDP).
|
| 32 |
+
|
| 33 |
+
## Test posture
|
| 34 |
+
|
| 35 |
+
- 187 passed / 16 skipped across the full package (no regressions).
|
| 36 |
+
- New: 7 (Dr.GRPO config + SDPO alignment), 12 (layered hints), 19 (datagen).
|
| 37 |
+
- The SDPO channel is proven both as loss-math-on-real-traces AND wired into a live `trl.GRPOTrainer` Dr. GRPO step.
|
| 38 |
+
|
| 39 |
+
## Reusable for the adjacent project
|
| 40 |
+
|
| 41 |
+
`composer_replication.datagen` is the data-gen primitive Codeseys wanted for
|
| 42 |
+
another project: **"invert a solved-repo dataset into a reimplement-to-pass
|
| 43 |
+
verifiable-reward task,"** with reward-hacking safeguards and an online
|
| 44 |
+
difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
|
| 45 |
+
(SWE-Gym, R2E-Gym, SWE-rebench — tens of thousands of tasks) a drop-in source.
|
| 46 |
+
|
| 47 |
+
## Owed / unblocked-by (constraints this wave hit)
|
| 48 |
+
|
| 49 |
+
- **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
|
| 50 |
+
V4 Pro per model-roster) — deferred: OpenRouter ran out of credits mid-wave,
|
| 51 |
+
blocking `delegate_task`. A focused single-author pre-mortem was substituted
|
| 52 |
+
(caught + fixed the ADR-006 stale-matrix-row). Re-run when credits restored.
|
| 53 |
+
- **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
|
| 54 |
+
image, run the 4 gates against it) — wired but deferred (no Docker in the CPU
|
| 55 |
+
dev env). Marked `[~]` in the ADR.
|
| 56 |
+
- Gateway restarted 6× during the wave; all work kept durable via
|
| 57 |
+
phase-boundary commits + detached `systemd-run` scopes.
|
| 58 |
+
|
| 59 |
+
## Commit trail
|
| 60 |
+
|
| 61 |
+
`7090729` (SDPO smoke) → `6049d00` (research) → `36ab61e` (ADRs) →
|
| 62 |
+
`bde5c5e` (ADR-008 code) → `2a34df4` (ADR-008 smoke+accept) →
|
| 63 |
+
`84740d4` (ADR-009) → `9336af3` (ADR-010).
|