Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Examples Index | |
| Five CPU-runnable examples demonstrating the framework end-to-end on | |
| real HF causal LMs. They form a progression from simplest to most | |
| methodologically complete: | |
| | # | Example | Trace source | Channels | Wall-clock | Closes | | |
| |---|---|---|---|---|---| | |
| | 1 | [`qwen_05b_quickstart/`](qwen_05b_quickstart/) | minimal toy | LM-CE only | ~30s | "does the package import + run at all" | | |
| | 2 | [`gsm8k_grpo/`](gsm8k_grpo/) | hand-written GSM8K (100 rows) | GRPO with `alpha=beta=0` | ~60s | Plain-GRPO baseline reference | | |
| | 3 | [`gsm8k_grpo_with_sdpo/`](gsm8k_grpo_with_sdpo/) | hand-written GSM8K (B=2) | GRPO + SDPO column | ~25s | SDPO column wiring on synthetic prompts | | |
| | 4 | [`sdpo_with_real_traces/`](sdpo_with_real_traces/) | `ClaudeCodeIngester` reading a hand-authored session JSONL | GRPO + SDPO column | ~30s | **Partial V5** — ingestion path validated; wiring smoke (misaligned) | | |
| | **5** | **[`sdpo_with_real_traces_production/`](sdpo_with_real_traces_production/)** | **`ClaudeCodeIngester` → adapter → `ComposerDataCollator`** (with-error fixture) | **GRPO + SDPO (production-aligned)** | **~2min** | **V5 closure** — full production pipeline with error-site detection + properly-aligned SDPO mask | | |
| **Recommended walk-through order**: 1 → 2 → 3 → 4 → 5. Each builds on | |
| the previous in scope. | |
| ## Why five? | |
| - **#1** verifies the package is installable and the loss composition | |
| works at all (no SDPO, no DPO — pure LM-CE on a toy model). | |
| - **#2** uses the production `ComposerReplicationTrainer` (TRL `GRPOTrainer` | |
| subclass) on a real GSM8K dataset with a regex-extract reward. This | |
| is the recipe a new user copy-pastes to start. | |
| - **#3** drops the TRL trainer wrapper and calls `compose_loss` directly | |
| on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5 | |
| changes the loss" with all the wiring visible. | |
| - **#4** uses real ingested Claude Code session JSONL (via | |
| `ClaudeCodeIngester`) but builds the SDPO batch by hand — | |
| demonstrates the ingester works but the SDPO mask covers misaligned | |
| content. Wiring smoke, not production-grade. | |
| - **#5** is the production-grade sibling to #4: adds the | |
| `claude_states_to_trace_examples` adapter and uses | |
| `ComposerDataCollator` to build properly-aligned SDPO batches with | |
| hint injection at actual error sites. **This is what you should copy | |
| for real training.** | |
| ## What every example asserts | |
| Each `run.py` ends with a verification block that asserts: | |
| - The targeted channel(s) actually fired (`sdpo_jsd > 0` when alpha_sdpo > 0) | |
| - The composed loss isn't trivially equal to `lm_ce` alone | |
| - Gradient norms are finite and non-zero at every step | |
| Failure of any assertion exits non-zero and the script prints which | |
| channel didn't fire. This is the user's smoke test, not just a demo. | |
| ## Production training | |
| For real training (GPU, larger models, longer rollouts), use | |
| `ComposerReplicationTrainer` directly with a `ComposerDataCollator` | |
| that emits SDPO + DPO columns — exactly the path example #5 | |
| demonstrates. See `docs/INTEGRATION_RECIPES.md` for the production | |
| wiring patterns. | |