Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 2,445 Bytes
ac05fbf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | # Quickstart: Qwen2.5-0.5B-Instruct on CPU
Run the Composer Replication Framework's 3-channel loss composition end-to-end
on a small open model in under 5 minutes on CPU.
## Setup
```bash
cd /path/to/composer-replication-framework
pip install -e .
```
(`-e` for editable install — picks up local code changes without re-installing.)
## Run
```bash
python examples/qwen_05b_quickstart/run.py
```
## Expected output
```
[quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ...
[quickstart] loaded — 0.494B params
[quickstart] building real chat-template batch ...
[quickstart] running 5 backward steps ...
step 0: total=0.7390 lm_ce=0.7385 sdpo=0.0000 dpo=0.0114 finite=True
step 1: total=0.2090 lm_ce=0.2086 sdpo=0.0000 dpo=0.0084 finite=True
step 2: total=0.0501 lm_ce=0.0496 sdpo=0.0000 dpo=0.0093 finite=True
step 3: total=0.0094 lm_ce=0.0089 sdpo=0.0000 dpo=0.0094 finite=True
step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True
========================================================
Initial loss: 0.7390
Final loss: 0.0031
Reduction: 99.6%
Verdict: PASS
========================================================
```
## What this demonstrates
- `build_batch(tokenizer)` produces a real chat-template-formatted batch
with all keys the 3-channel loss composer needs.
- `compose_loss(model, batch, alpha_sdpo, beta_replay)` returns
`LossComponents` with per-channel breakdown.
- Backward pass through `components.total` flows into all three channels:
- `lm_ce`: the GRPO stub (cross-entropy on response tokens, the limit
GRPO converges to under deterministic rewards).
- `sdpo_jsd`: hint-distillation between student logits and
hint-conditioned-teacher logits.
- `trace_replay_dpo`: DPO loss over (chosen, rejected) pairs from
multi-teacher disagreement.
## What this does NOT demonstrate
- Real GRPO rollouts + reward calculation (use `ComposerReplicationTrainer`
for that — a TRL `GRPOTrainer` subclass that wraps the same 3-channel
loss).
- Real teacher calls (those go through `composer_replication.replay_trace`
+ OpenRouter; ~$0.98 per 50-step trace at last measurement).
- DiLoCo outer loop (separate; needs `torchft-nightly` and is a
`make_diloco_outer_loop()` away once installed).
## Cost
- $0
- ~3-5 minutes wall-clock on CPU
- ~1 GB disk for Qwen2.5-0.5B weights (downloaded once into `~/.cache/huggingface`)
|