Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Quickstart: Qwen2.5-0.5B-Instruct on CPU
Run the Composer Replication Framework's 3-channel loss composition end-to-end on a small open model in under 5 minutes on CPU.
Setup
cd /path/to/composer-replication-framework
pip install -e .
(-e for editable install — picks up local code changes without re-installing.)
Run
python examples/qwen_05b_quickstart/run.py
Expected output
[quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ...
[quickstart] loaded — 0.494B params
[quickstart] building real chat-template batch ...
[quickstart] running 5 backward steps ...
step 0: total=0.7390 lm_ce=0.7385 sdpo=0.0000 dpo=0.0114 finite=True
step 1: total=0.2090 lm_ce=0.2086 sdpo=0.0000 dpo=0.0084 finite=True
step 2: total=0.0501 lm_ce=0.0496 sdpo=0.0000 dpo=0.0093 finite=True
step 3: total=0.0094 lm_ce=0.0089 sdpo=0.0000 dpo=0.0094 finite=True
step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True
========================================================
Initial loss: 0.7390
Final loss: 0.0031
Reduction: 99.6%
Verdict: PASS
========================================================
What this demonstrates
build_batch(tokenizer)produces a real chat-template-formatted batch with all keys the 3-channel loss composer needs.compose_loss(model, batch, alpha_sdpo, beta_replay)returnsLossComponentswith per-channel breakdown.- Backward pass through
components.totalflows into all three channels:lm_ce: the GRPO stub (cross-entropy on response tokens, the limit GRPO converges to under deterministic rewards).sdpo_jsd: hint-distillation between student logits and hint-conditioned-teacher logits.trace_replay_dpo: DPO loss over (chosen, rejected) pairs from multi-teacher disagreement.
What this does NOT demonstrate
- Real GRPO rollouts + reward calculation (use
ComposerReplicationTrainerfor that — a TRLGRPOTrainersubclass that wraps the same 3-channel loss). - Real teacher calls (those go through
composer_replication.replay_trace- OpenRouter; ~$0.98 per 50-step trace at last measurement).
- DiLoCo outer loop (separate; needs
torchft-nightlyand is amake_diloco_outer_loop()away once installed).
Cost
- $0
- ~3-5 minutes wall-clock on CPU
1 GB disk for Qwen2.5-0.5B weights (downloaded once into `/.cache/huggingface`)