Wave 19: production-grade SDPO via ComposerDataCollator + adapter + collator fixes

03bf323 13 days ago

3.12 kB

	# Examples Index

	Five CPU-runnable examples demonstrating the framework end-to-end on
	real HF causal LMs. They form a progression from simplest to most
	methodologically complete:

	\| # \| Example \| Trace source \| Channels \| Wall-clock \| Closes \|
	\|---\|---\|---\|---\|---\|---\|
	\| 1 \| [`qwen_05b_quickstart/`](qwen_05b_quickstart/) \| minimal toy \| LM-CE only \| ~30s \| "does the package import + run at all" \|
	\| 2 \| [`gsm8k_grpo/`](gsm8k_grpo/) \| hand-written GSM8K (100 rows) \| GRPO with `alpha=beta=0` \| ~60s \| Plain-GRPO baseline reference \|
	\| 3 \| [`gsm8k_grpo_with_sdpo/`](gsm8k_grpo_with_sdpo/) \| hand-written GSM8K (B=2) \| GRPO + SDPO column \| ~25s \| SDPO column wiring on synthetic prompts \|
	\| 4 \| [`sdpo_with_real_traces/`](sdpo_with_real_traces/) \| `ClaudeCodeIngester` reading a hand-authored session JSONL \| GRPO + SDPO column \| ~30s \| Partial V5 — ingestion path validated; wiring smoke (misaligned) \|
	\| 5 \| [`sdpo_with_real_traces_production/`](sdpo_with_real_traces_production/) \| `ClaudeCodeIngester` → adapter → `ComposerDataCollator` (with-error fixture) \| GRPO + SDPO (production-aligned) \| ~2min \| V5 closure — full production pipeline with error-site detection + properly-aligned SDPO mask \|

	Recommended walk-through order: 1 → 2 → 3 → 4 → 5. Each builds on
	the previous in scope.

	## Why five?

	- #1 verifies the package is installable and the loss composition
	works at all (no SDPO, no DPO — pure LM-CE on a toy model).
	- #2 uses the production `ComposerReplicationTrainer` (TRL `GRPOTrainer`
	subclass) on a real GSM8K dataset with a regex-extract reward. This
	is the recipe a new user copy-pastes to start.
	- #3 drops the TRL trainer wrapper and calls `compose_loss` directly
	on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5
	changes the loss" with all the wiring visible.
	- #4 uses real ingested Claude Code session JSONL (via
	`ClaudeCodeIngester`) but builds the SDPO batch by hand —
	demonstrates the ingester works but the SDPO mask covers misaligned
	content. Wiring smoke, not production-grade.
	- #5 is the production-grade sibling to #4: adds the
	`claude_states_to_trace_examples` adapter and uses
	`ComposerDataCollator` to build properly-aligned SDPO batches with
	hint injection at actual error sites. **This is what you should copy
	for real training.**

	## What every example asserts

	Each `run.py` ends with a verification block that asserts:

	- The targeted channel(s) actually fired (`sdpo_jsd > 0` when alpha_sdpo > 0)
	- The composed loss isn't trivially equal to `lm_ce` alone
	- Gradient norms are finite and non-zero at every step

	Failure of any assertion exits non-zero and the script prints which
	channel didn't fire. This is the user's smoke test, not just a demo.

	## Production training

	For real training (GPU, larger models, longer rollouts), use
	`ComposerReplicationTrainer` directly with a `ComposerDataCollator`
	that emits SDPO + DPO columns — exactly the path example #5
	demonstrates. See `docs/INTEGRATION_RECIPES.md` for the production
	wiring patterns.