Codeseys's picture
Wave 10 — packaging: composer_replication is now pip-installable
ac05fbf

Quickstart: Qwen2.5-0.5B-Instruct on CPU

Run the Composer Replication Framework's 3-channel loss composition end-to-end on a small open model in under 5 minutes on CPU.

Setup

cd /path/to/composer-replication-framework
pip install -e .

(-e for editable install — picks up local code changes without re-installing.)

Run

python examples/qwen_05b_quickstart/run.py

Expected output

[quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ...
[quickstart] loaded — 0.494B params
[quickstart] building real chat-template batch ...
[quickstart] running 5 backward steps ...
  step 0: total=0.7390  lm_ce=0.7385  sdpo=0.0000  dpo=0.0114  finite=True
  step 1: total=0.2090  lm_ce=0.2086  sdpo=0.0000  dpo=0.0084  finite=True
  step 2: total=0.0501  lm_ce=0.0496  sdpo=0.0000  dpo=0.0093  finite=True
  step 3: total=0.0094  lm_ce=0.0089  sdpo=0.0000  dpo=0.0094  finite=True
  step 4: total=0.0031  lm_ce=0.0029  sdpo=0.0000  dpo=0.0044  finite=True

========================================================
  Initial loss: 0.7390
  Final loss:   0.0031
  Reduction:    99.6%
  Verdict:      PASS
========================================================

What this demonstrates

  • build_batch(tokenizer) produces a real chat-template-formatted batch with all keys the 3-channel loss composer needs.
  • compose_loss(model, batch, alpha_sdpo, beta_replay) returns LossComponents with per-channel breakdown.
  • Backward pass through components.total flows into all three channels:
    • lm_ce: the GRPO stub (cross-entropy on response tokens, the limit GRPO converges to under deterministic rewards).
    • sdpo_jsd: hint-distillation between student logits and hint-conditioned-teacher logits.
    • trace_replay_dpo: DPO loss over (chosen, rejected) pairs from multi-teacher disagreement.

What this does NOT demonstrate

  • Real GRPO rollouts + reward calculation (use ComposerReplicationTrainer for that — a TRL GRPOTrainer subclass that wraps the same 3-channel loss).
  • Real teacher calls (those go through composer_replication.replay_trace
    • OpenRouter; ~$0.98 per 50-step trace at last measurement).
  • DiLoCo outer loop (separate; needs torchft-nightly and is a make_diloco_outer_loop() away once installed).

Cost

  • $0
  • ~3-5 minutes wall-clock on CPU
  • 1 GB disk for Qwen2.5-0.5B weights (downloaded once into `/.cache/huggingface`)