File size: 3,116 Bytes
54efac8
 
03bf323
54efac8
 
 
 
 
 
 
 
03bf323
 
54efac8
03bf323
 
54efac8
03bf323
54efac8
 
 
 
 
 
 
 
 
 
03bf323
 
 
 
 
 
 
 
54efac8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03bf323
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Examples Index

Five CPU-runnable examples demonstrating the framework end-to-end on
real HF causal LMs. They form a progression from simplest to most
methodologically complete:

| # | Example | Trace source | Channels | Wall-clock | Closes |
|---|---|---|---|---|---|
| 1 | [`qwen_05b_quickstart/`](qwen_05b_quickstart/) | minimal toy | LM-CE only | ~30s | "does the package import + run at all" |
| 2 | [`gsm8k_grpo/`](gsm8k_grpo/) | hand-written GSM8K (100 rows) | GRPO with `alpha=beta=0` | ~60s | Plain-GRPO baseline reference |
| 3 | [`gsm8k_grpo_with_sdpo/`](gsm8k_grpo_with_sdpo/) | hand-written GSM8K (B=2) | GRPO + SDPO column | ~25s | SDPO column wiring on synthetic prompts |
| 4 | [`sdpo_with_real_traces/`](sdpo_with_real_traces/) | `ClaudeCodeIngester` reading a hand-authored session JSONL | GRPO + SDPO column | ~30s | **Partial V5** — ingestion path validated; wiring smoke (misaligned) |
| **5** | **[`sdpo_with_real_traces_production/`](sdpo_with_real_traces_production/)** | **`ClaudeCodeIngester` → adapter → `ComposerDataCollator`** (with-error fixture) | **GRPO + SDPO (production-aligned)** | **~2min** | **V5 closure** — full production pipeline with error-site detection + properly-aligned SDPO mask |

**Recommended walk-through order**: 1 → 2 → 3 → 4 → 5. Each builds on
the previous in scope.

## Why five?

- **#1** verifies the package is installable and the loss composition
  works at all (no SDPO, no DPO — pure LM-CE on a toy model).
- **#2** uses the production `ComposerReplicationTrainer` (TRL `GRPOTrainer`
  subclass) on a real GSM8K dataset with a regex-extract reward. This
  is the recipe a new user copy-pastes to start.
- **#3** drops the TRL trainer wrapper and calls `compose_loss` directly
  on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5
  changes the loss" with all the wiring visible.
- **#4** uses real ingested Claude Code session JSONL (via
  `ClaudeCodeIngester`) but builds the SDPO batch by hand —
  demonstrates the ingester works but the SDPO mask covers misaligned
  content. Wiring smoke, not production-grade.
- **#5** is the production-grade sibling to #4: adds the
  `claude_states_to_trace_examples` adapter and uses
  `ComposerDataCollator` to build properly-aligned SDPO batches with
  hint injection at actual error sites. **This is what you should copy
  for real training.**

## What every example asserts

Each `run.py` ends with a verification block that asserts:

- The targeted channel(s) actually fired (`sdpo_jsd > 0` when alpha_sdpo > 0)
- The composed loss isn't trivially equal to `lm_ce` alone
- Gradient norms are finite and non-zero at every step

Failure of any assertion exits non-zero and the script prints which
channel didn't fire. This is the user's smoke test, not just a demo.

## Production training

For real training (GPU, larger models, longer rollouts), use
`ComposerReplicationTrainer` directly with a `ComposerDataCollator`
that emits SDPO + DPO columns — exactly the path example #5
demonstrates. See `docs/INTEGRATION_RECIPES.md` for the production
wiring patterns.