# Quickstart: Qwen2.5-0.5B-Instruct on CPU Run the Composer Replication Framework's 3-channel loss composition end-to-end on a small open model in under 5 minutes on CPU. ## Setup ```bash cd /path/to/composer-replication-framework pip install -e . ``` (`-e` for editable install — picks up local code changes without re-installing.) ## Run ```bash python examples/qwen_05b_quickstart/run.py ``` ## Expected output ``` [quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ... [quickstart] loaded — 0.494B params [quickstart] building real chat-template batch ... [quickstart] running 5 backward steps ... step 0: total=0.7390 lm_ce=0.7385 sdpo=0.0000 dpo=0.0114 finite=True step 1: total=0.2090 lm_ce=0.2086 sdpo=0.0000 dpo=0.0084 finite=True step 2: total=0.0501 lm_ce=0.0496 sdpo=0.0000 dpo=0.0093 finite=True step 3: total=0.0094 lm_ce=0.0089 sdpo=0.0000 dpo=0.0094 finite=True step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True ======================================================== Initial loss: 0.7390 Final loss: 0.0031 Reduction: 99.6% Verdict: PASS ======================================================== ``` ## What this demonstrates - `build_batch(tokenizer)` produces a real chat-template-formatted batch with all keys the 3-channel loss composer needs. - `compose_loss(model, batch, alpha_sdpo, beta_replay)` returns `LossComponents` with per-channel breakdown. - Backward pass through `components.total` flows into all three channels: - `lm_ce`: the GRPO stub (cross-entropy on response tokens, the limit GRPO converges to under deterministic rewards). - `sdpo_jsd`: hint-distillation between student logits and hint-conditioned-teacher logits. - `trace_replay_dpo`: DPO loss over (chosen, rejected) pairs from multi-teacher disagreement. ## What this does NOT demonstrate - Real GRPO rollouts + reward calculation (use `ComposerReplicationTrainer` for that — a TRL `GRPOTrainer` subclass that wraps the same 3-channel loss). - Real teacher calls (those go through `composer_replication.replay_trace` + OpenRouter; ~$0.98 per 50-step trace at last measurement). - DiLoCo outer loop (separate; needs `torchft-nightly` and is a `make_diloco_outer_loop()` away once installed). ## Cost - $0 - ~3-5 minutes wall-clock on CPU - ~1 GB disk for Qwen2.5-0.5B weights (downloaded once into `~/.cache/huggingface`)