Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Quickstart: Qwen2.5-0.5B-Instruct on CPU | |
| Run the Composer Replication Framework's 3-channel loss composition end-to-end | |
| on a small open model in under 5 minutes on CPU. | |
| ## Setup | |
| ```bash | |
| cd /path/to/composer-replication-framework | |
| pip install -e . | |
| ``` | |
| (`-e` for editable install — picks up local code changes without re-installing.) | |
| ## Run | |
| ```bash | |
| python examples/qwen_05b_quickstart/run.py | |
| ``` | |
| ## Expected output | |
| ``` | |
| [quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ... | |
| [quickstart] loaded — 0.494B params | |
| [quickstart] building real chat-template batch ... | |
| [quickstart] running 5 backward steps ... | |
| step 0: total=0.7390 lm_ce=0.7385 sdpo=0.0000 dpo=0.0114 finite=True | |
| step 1: total=0.2090 lm_ce=0.2086 sdpo=0.0000 dpo=0.0084 finite=True | |
| step 2: total=0.0501 lm_ce=0.0496 sdpo=0.0000 dpo=0.0093 finite=True | |
| step 3: total=0.0094 lm_ce=0.0089 sdpo=0.0000 dpo=0.0094 finite=True | |
| step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True | |
| ======================================================== | |
| Initial loss: 0.7390 | |
| Final loss: 0.0031 | |
| Reduction: 99.6% | |
| Verdict: PASS | |
| ======================================================== | |
| ``` | |
| ## What this demonstrates | |
| - `build_batch(tokenizer)` produces a real chat-template-formatted batch | |
| with all keys the 3-channel loss composer needs. | |
| - `compose_loss(model, batch, alpha_sdpo, beta_replay)` returns | |
| `LossComponents` with per-channel breakdown. | |
| - Backward pass through `components.total` flows into all three channels: | |
| - `lm_ce`: the GRPO stub (cross-entropy on response tokens, the limit | |
| GRPO converges to under deterministic rewards). | |
| - `sdpo_jsd`: hint-distillation between student logits and | |
| hint-conditioned-teacher logits. | |
| - `trace_replay_dpo`: DPO loss over (chosen, rejected) pairs from | |
| multi-teacher disagreement. | |
| ## What this does NOT demonstrate | |
| - Real GRPO rollouts + reward calculation (use `ComposerReplicationTrainer` | |
| for that — a TRL `GRPOTrainer` subclass that wraps the same 3-channel | |
| loss). | |
| - Real teacher calls (those go through `composer_replication.replay_trace` | |
| + OpenRouter; ~$0.98 per 50-step trace at last measurement). | |
| - DiLoCo outer loop (separate; needs `torchft-nightly` and is a | |
| `make_diloco_outer_loop()` away once installed). | |
| ## Cost | |
| - $0 | |
| - ~3-5 minutes wall-clock on CPU | |
| - ~1 GB disk for Qwen2.5-0.5B weights (downloaded once into `~/.cache/huggingface`) | |