Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
feat(trainer): ADR-008 gate-3 live GRPO+SDPO smoke PASS; ADR-008 accepted
Browse filesThe composer_grpo_sdpo_smoke instantiates a real trl.GRPOTrainer via
ComposerReplicationTrainer on Qwen2.5-0.5B with the Dr. GRPO config
(loss_type=dr_grpo, scale_rewards=none, num_iterations=1) and alpha_sdpo=1.0,
runs 1 step, and confirms loss/sdpo_kl is logged — the SDPO channel is wired
into the live Dr. GRPO loop. PASS.
Surfaced + fixed a real API drift: TRL 1.5.0 dropped GRPOConfig.max_prompt_length.
All 4 ADR-008 acceptance gates green -> status proposed->accepted; gate
checkboxes ticked with evidence; index updated.
docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
status:
|
| 3 |
date: 2026-05-29
|
| 4 |
amends: ADR-006
|
| 5 |
deciders: [Codeseys, ARIA]
|
|
@@ -104,11 +104,14 @@ guarantee, and add a CPU smoke test that instantiates the trainer and runs a
|
|
| 104 |
|
| 105 |
## Acceptance gate (must be green before status flips to accepted)
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
- [
|
| 111 |
-
- [
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
## More Information
|
| 114 |
|
|
|
|
| 1 |
---
|
| 2 |
+
status: accepted
|
| 3 |
date: 2026-05-29
|
| 4 |
amends: ADR-006
|
| 5 |
deciders: [Codeseys, ARIA]
|
|
|
|
| 104 |
|
| 105 |
## Acceptance gate (must be green before status flips to accepted)
|
| 106 |
|
| 107 |
+
All gates green as of 2026-05-29 (commit chain `bde5c5e`+; smoke PASS via
|
| 108 |
+
`examples/composer_grpo_sdpo_smoke`).
|
| 109 |
+
|
| 110 |
+
- [x] `make_dr_grpo_config(...)` helper exists and sets: no std-norm advantage scaling (`scale_rewards="none"`), no length-standardization (`loss_type="dr_grpo"`), Adam, `num_iterations=1` (single-epoch) — each asserted in `test_make_dr_grpo_config_sets_dr_grpo_knobs`. (k1 KL is TRL's native estimator; `beta` left at caller's choice.)
|
| 111 |
+
- [x] `ComposerReplicationTrainer._compute_sdpo_loss` trust-gap closed: `strict_sdpo_alignment=True` (default) raises `ValueError` on a student/teacher shape mismatch instead of silently zeroing — `test_strict_alignment_raises_on_shape_mismatch`; non-strict warn-and-skip + aligned-fires + no-error-site no-op also tested.
|
| 112 |
+
- [x] CPU smoke (`examples/composer_grpo_sdpo_smoke/run.py`) instantiates `ComposerReplicationTrainer` with a real `trl.GRPOTrainer` parent on Qwen2.5-0.5B, runs 1 Dr. GRPO step with `alpha_sdpo=1.0`, asserts the step runs and `loss/sdpo_kl` is logged (SDPO channel wired into the live loop). PASS 2026-05-29. NOTE: signal *firing* (`sdpo_kl>0`) on real error-aligned batches is proven separately by `examples/sdpo_real_trace_train_smoke`; the toy GRPO rollouts here carry no collator-built error sites, so the wrapper smoke asserts wiring + finite step, not a positive KL value.
|
| 113 |
+
- [x] TRL pinned in `[train]` extra; the Dr-GRPO config knobs guarded with a drift assertion in `make_dr_grpo_config` (fails loudly if `loss_type`/`scale_rewards`/`num_iterations` semantics drift). Verified against TRL 1.5.0, whose own help text cites the Dr. GRPO paper for both knobs.
|
| 114 |
+
- [x] `recipes/prime_rl/composer_loss.py` `NotImplementedError(alpha_sdpo>0)` path tested (`test_alpha_sdpo_nonzero_raises_not_implemented`) and its message now points at the TRL host (documents the ADR-006 amendment in code).
|
| 115 |
|
| 116 |
## More Information
|
| 117 |
|
docs/adrs/README.md
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
| [ADR-005](ADR-005-serverless-diloco.md) | Serverless DiLoCo | accepted | — |
|
| 10 |
| [ADR-006](ADR-006-rl-frameworks.md) | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 |
|
| 11 |
| [ADR-007](ADR-007-self-distillation-losses.md) | Self-distillation losses landscape | accepted | 2026-05-26 |
|
| 12 |
-
| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer |
|
| 13 |
| [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | proposed | 2026-05-29 |
|
| 14 |
| [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | proposed | 2026-05-29 |
|
| 15 |
|
|
|
|
| 9 |
| [ADR-005](ADR-005-serverless-diloco.md) | Serverless DiLoCo | accepted | — |
|
| 10 |
| [ADR-006](ADR-006-rl-frameworks.md) | RL framework strategy: TRL + VeRL + PRIME-RL | accepted (amended-by ADR-008) | 2026-05-26 |
|
| 11 |
| [ADR-007](ADR-007-self-distillation-losses.md) | Self-distillation losses landscape | accepted | 2026-05-26 |
|
| 12 |
+
| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
|
| 13 |
| [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | proposed | 2026-05-29 |
|
| 14 |
| [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | proposed | 2026-05-29 |
|
| 15 |
|
examples/composer_grpo_sdpo_smoke/run.py
CHANGED
|
@@ -79,7 +79,6 @@ def main() -> int:
|
|
| 79 |
per_device_train_batch_size=2,
|
| 80 |
num_generations=2,
|
| 81 |
max_completion_length=8,
|
| 82 |
-
max_prompt_length=32,
|
| 83 |
max_steps=1,
|
| 84 |
logging_steps=1,
|
| 85 |
report_to=[],
|
|
|
|
| 79 |
per_device_train_batch_size=2,
|
| 80 |
num_generations=2,
|
| 81 |
max_completion_length=8,
|
|
|
|
| 82 |
max_steps=1,
|
| 83 |
logging_steps=1,
|
| 84 |
report_to=[],
|