Wave 10 — packaging: composer_replication is now pip-installable

ac05fbf 12 days ago

2.45 kB

	# Quickstart: Qwen2.5-0.5B-Instruct on CPU

	Run the Composer Replication Framework's 3-channel loss composition end-to-end
	on a small open model in under 5 minutes on CPU.

	## Setup

	```bash
	cd /path/to/composer-replication-framework
	pip install -e .
	```

	(`-e` for editable install — picks up local code changes without re-installing.)

	## Run

	```bash
	python examples/qwen_05b_quickstart/run.py
	```

	## Expected output

	```
	[quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ...
	[quickstart] loaded — 0.494B params
	[quickstart] building real chat-template batch ...
	[quickstart] running 5 backward steps ...
	step 0: total=0.7390 lm_ce=0.7385 sdpo=0.0000 dpo=0.0114 finite=True
	step 1: total=0.2090 lm_ce=0.2086 sdpo=0.0000 dpo=0.0084 finite=True
	step 2: total=0.0501 lm_ce=0.0496 sdpo=0.0000 dpo=0.0093 finite=True
	step 3: total=0.0094 lm_ce=0.0089 sdpo=0.0000 dpo=0.0094 finite=True
	step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True

	========================================================
	Initial loss: 0.7390
	Final loss: 0.0031
	Reduction: 99.6%
	Verdict: PASS
	========================================================
	```

	## What this demonstrates

	- `build_batch(tokenizer)` produces a real chat-template-formatted batch
	with all keys the 3-channel loss composer needs.
	- `compose_loss(model, batch, alpha_sdpo, beta_replay)` returns
	`LossComponents` with per-channel breakdown.
	- Backward pass through `components.total` flows into all three channels:
	- `lm_ce`: the GRPO stub (cross-entropy on response tokens, the limit
	GRPO converges to under deterministic rewards).
	- `sdpo_jsd`: hint-distillation between student logits and
	hint-conditioned-teacher logits.
	- `trace_replay_dpo`: DPO loss over (chosen, rejected) pairs from
	multi-teacher disagreement.

	## What this does NOT demonstrate

	- Real GRPO rollouts + reward calculation (use `ComposerReplicationTrainer`
	for that — a TRL `GRPOTrainer` subclass that wraps the same 3-channel
	loss).
	- Real teacher calls (those go through `composer_replication.replay_trace`
	+ OpenRouter; ~$0.98 per 50-step trace at last measurement).
	- DiLoCo outer loop (separate; needs `torchft-nightly` and is a
	`make_diloco_outer_loop()` away once installed).

	## Cost

	- $0
	- ~3-5 minutes wall-clock on CPU
	- ~1 GB disk for Qwen2.5-0.5B weights (downloaded once into `~/.cache/huggingface`)