Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 3,116 Bytes
54efac8 03bf323 54efac8 03bf323 54efac8 03bf323 54efac8 03bf323 54efac8 03bf323 54efac8 03bf323 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | # Examples Index
Five CPU-runnable examples demonstrating the framework end-to-end on
real HF causal LMs. They form a progression from simplest to most
methodologically complete:
| # | Example | Trace source | Channels | Wall-clock | Closes |
|---|---|---|---|---|---|
| 1 | [`qwen_05b_quickstart/`](qwen_05b_quickstart/) | minimal toy | LM-CE only | ~30s | "does the package import + run at all" |
| 2 | [`gsm8k_grpo/`](gsm8k_grpo/) | hand-written GSM8K (100 rows) | GRPO with `alpha=beta=0` | ~60s | Plain-GRPO baseline reference |
| 3 | [`gsm8k_grpo_with_sdpo/`](gsm8k_grpo_with_sdpo/) | hand-written GSM8K (B=2) | GRPO + SDPO column | ~25s | SDPO column wiring on synthetic prompts |
| 4 | [`sdpo_with_real_traces/`](sdpo_with_real_traces/) | `ClaudeCodeIngester` reading a hand-authored session JSONL | GRPO + SDPO column | ~30s | **Partial V5** — ingestion path validated; wiring smoke (misaligned) |
| **5** | **[`sdpo_with_real_traces_production/`](sdpo_with_real_traces_production/)** | **`ClaudeCodeIngester` → adapter → `ComposerDataCollator`** (with-error fixture) | **GRPO + SDPO (production-aligned)** | **~2min** | **V5 closure** — full production pipeline with error-site detection + properly-aligned SDPO mask |
**Recommended walk-through order**: 1 → 2 → 3 → 4 → 5. Each builds on
the previous in scope.
## Why five?
- **#1** verifies the package is installable and the loss composition
works at all (no SDPO, no DPO — pure LM-CE on a toy model).
- **#2** uses the production `ComposerReplicationTrainer` (TRL `GRPOTrainer`
subclass) on a real GSM8K dataset with a regex-extract reward. This
is the recipe a new user copy-pastes to start.
- **#3** drops the TRL trainer wrapper and calls `compose_loss` directly
on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5
changes the loss" with all the wiring visible.
- **#4** uses real ingested Claude Code session JSONL (via
`ClaudeCodeIngester`) but builds the SDPO batch by hand —
demonstrates the ingester works but the SDPO mask covers misaligned
content. Wiring smoke, not production-grade.
- **#5** is the production-grade sibling to #4: adds the
`claude_states_to_trace_examples` adapter and uses
`ComposerDataCollator` to build properly-aligned SDPO batches with
hint injection at actual error sites. **This is what you should copy
for real training.**
## What every example asserts
Each `run.py` ends with a verification block that asserts:
- The targeted channel(s) actually fired (`sdpo_jsd > 0` when alpha_sdpo > 0)
- The composed loss isn't trivially equal to `lm_ce` alone
- Gradient norms are finite and non-zero at every step
Failure of any assertion exits non-zero and the script prints which
channel didn't fire. This is the user's smoke test, not just a demo.
## Production training
For real training (GPU, larger models, longer rollouts), use
`ComposerReplicationTrainer` directly with a `ComposerDataCollator`
that emits SDPO + DPO columns — exactly the path example #5
demonstrates. See `docs/INTEGRATION_RECIPES.md` for the production
wiring patterns.
|