File size: 2,445 Bytes
ac05fbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Quickstart: Qwen2.5-0.5B-Instruct on CPU

Run the Composer Replication Framework's 3-channel loss composition end-to-end
on a small open model in under 5 minutes on CPU.

## Setup

```bash
cd /path/to/composer-replication-framework
pip install -e .
```

(`-e` for editable install — picks up local code changes without re-installing.)

## Run

```bash
python examples/qwen_05b_quickstart/run.py
```

## Expected output

```
[quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ...
[quickstart] loaded — 0.494B params
[quickstart] building real chat-template batch ...
[quickstart] running 5 backward steps ...
  step 0: total=0.7390  lm_ce=0.7385  sdpo=0.0000  dpo=0.0114  finite=True
  step 1: total=0.2090  lm_ce=0.2086  sdpo=0.0000  dpo=0.0084  finite=True
  step 2: total=0.0501  lm_ce=0.0496  sdpo=0.0000  dpo=0.0093  finite=True
  step 3: total=0.0094  lm_ce=0.0089  sdpo=0.0000  dpo=0.0094  finite=True
  step 4: total=0.0031  lm_ce=0.0029  sdpo=0.0000  dpo=0.0044  finite=True

========================================================
  Initial loss: 0.7390
  Final loss:   0.0031
  Reduction:    99.6%
  Verdict:      PASS
========================================================
```

## What this demonstrates

- `build_batch(tokenizer)` produces a real chat-template-formatted batch
  with all keys the 3-channel loss composer needs.
- `compose_loss(model, batch, alpha_sdpo, beta_replay)` returns
  `LossComponents` with per-channel breakdown.
- Backward pass through `components.total` flows into all three channels:
  - `lm_ce`: the GRPO stub (cross-entropy on response tokens, the limit
    GRPO converges to under deterministic rewards).
  - `sdpo_jsd`: hint-distillation between student logits and
    hint-conditioned-teacher logits.
  - `trace_replay_dpo`: DPO loss over (chosen, rejected) pairs from
    multi-teacher disagreement.

## What this does NOT demonstrate

- Real GRPO rollouts + reward calculation (use `ComposerReplicationTrainer`
  for that — a TRL `GRPOTrainer` subclass that wraps the same 3-channel
  loss).
- Real teacher calls (those go through `composer_replication.replay_trace`
  + OpenRouter; ~$0.98 per 50-step trace at last measurement).
- DiLoCo outer loop (separate; needs `torchft-nightly` and is a
  `make_diloco_outer_loop()` away once installed).

## Cost

- $0
- ~3-5 minutes wall-clock on CPU
- ~1 GB disk for Qwen2.5-0.5B weights (downloaded once into `~/.cache/huggingface`)