Codeseys's picture
Wave 10 — packaging: composer_replication is now pip-installable
ac05fbf
# Quickstart: Qwen2.5-0.5B-Instruct on CPU
Run the Composer Replication Framework's 3-channel loss composition end-to-end
on a small open model in under 5 minutes on CPU.
## Setup
```bash
cd /path/to/composer-replication-framework
pip install -e .
```
(`-e` for editable install — picks up local code changes without re-installing.)
## Run
```bash
python examples/qwen_05b_quickstart/run.py
```
## Expected output
```
[quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ...
[quickstart] loaded — 0.494B params
[quickstart] building real chat-template batch ...
[quickstart] running 5 backward steps ...
step 0: total=0.7390 lm_ce=0.7385 sdpo=0.0000 dpo=0.0114 finite=True
step 1: total=0.2090 lm_ce=0.2086 sdpo=0.0000 dpo=0.0084 finite=True
step 2: total=0.0501 lm_ce=0.0496 sdpo=0.0000 dpo=0.0093 finite=True
step 3: total=0.0094 lm_ce=0.0089 sdpo=0.0000 dpo=0.0094 finite=True
step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True
========================================================
Initial loss: 0.7390
Final loss: 0.0031
Reduction: 99.6%
Verdict: PASS
========================================================
```
## What this demonstrates
- `build_batch(tokenizer)` produces a real chat-template-formatted batch
with all keys the 3-channel loss composer needs.
- `compose_loss(model, batch, alpha_sdpo, beta_replay)` returns
`LossComponents` with per-channel breakdown.
- Backward pass through `components.total` flows into all three channels:
- `lm_ce`: the GRPO stub (cross-entropy on response tokens, the limit
GRPO converges to under deterministic rewards).
- `sdpo_jsd`: hint-distillation between student logits and
hint-conditioned-teacher logits.
- `trace_replay_dpo`: DPO loss over (chosen, rejected) pairs from
multi-teacher disagreement.
## What this does NOT demonstrate
- Real GRPO rollouts + reward calculation (use `ComposerReplicationTrainer`
for that — a TRL `GRPOTrainer` subclass that wraps the same 3-channel
loss).
- Real teacher calls (those go through `composer_replication.replay_trace`
+ OpenRouter; ~$0.98 per 50-step trace at last measurement).
- DiLoCo outer loop (separate; needs `torchft-nightly` and is a
`make_diloco_outer_loop()` away once installed).
## Cost
- $0
- ~3-5 minutes wall-clock on CPU
- ~1 GB disk for Qwen2.5-0.5B weights (downloaded once into `~/.cache/huggingface`)