baladithyab
Wave 4: data collator + loss composition smoke (38/38 tests pass)
157cdba
# v0.0 Spike — Composer Replication Framework
> Decomposed from the framework synthesis (`framework/composer-replication-framework.md`).
> Goal of v0.0: **prove the trace-replay multi-teacher distillation channel adds signal on top of plain GRPO**, on the smallest viable model.
> If the spike validates, we move to v0.1 (full Composer recipe). If it invalidates, the framework still has value (Composer recipe alone) but the novel claim is dead and we reorient.
## Risk-ordered decomposition
| # | Spike | Validates (Given / When / Then) | Why this risk first | Status |
|---|-------|----------------------------------|---------------------|--------|
| **001** | `001-teacher-replay-cost` | **Given** a frozen 100-step agentic-coding trace and a state at step `t`, **when** N=3 frozen teachers (Opus 4.7 / GPT-5 / DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions, **then** total per-trace teacher cost is < $5 and wallclock per step is < 30 s. | If teachers cost $50+/trace or take 5 min/step, the channel is unviable regardless of whether it improves training. **Kill-switch first.** | 🟢 **VALIDATED** (2026-05-25): $0.98/trace, p95 lat 20.5s, 0 errors |
| **005** | `005-integrated-trainer-skeleton` | **Given** the SDPO loss math (lifted from `siyan-zhao/OPSD`) and the teacher-disagreement DPO-pair extractor, **when** we wire them into a `GRPOTrainer` subclass with α/β channel weights, **then** unit tests cover loss differentiability + correctness, and ablating any channel via α=0/β=0 reduces to GRPO. | Proves the integration architecture compiles before paying GPU costs. Cheap (no GPU, no API). | 🟢 **SKELETON-VALIDATED + COMPOSITION-VERIFIED**: 38/38 unit tests pass; 5-step gradient run on tiny model decreases loss with all 3 channels active |
| **002a** | `002a-trace-collection-trl` | **Given** Qwen3-7B base + TRL `GRPOTrainer` + a SWE-bench-lite OpenEnv, **when** we run 100 rollouts, **then** all rollouts emit complete `(state_t, action_t, reward_t)` tuples to JSONL with no truncation or schema drift. | Without a clean trace stream, no signal to replay. Validates TRL+OpenEnv plumbing. | 📋 planned |
| **002b** | `002b-trace-collection-prime-rl` | Same as 002a but with PRIME-RL substrate. | Comparison: which framework's trace export is cleaner? | 📋 planned |
| **003** | `003-dpo-pairs-from-disagreement` | **Given** N=3 teacher action distributions per trace step and the student's own action, **when** we extract preference pairs by "majority of teachers > student" + "student > minority", **then** the resulting DPO dataset has ≥ 5 pairs/trace and a non-trivial KL distance from random pairs. | The reward shape needs to actually carry signal, not just exist. Spike 005 already verified the *extraction logic*; spike 003 measures *signal density on real traces*. | 📋 planned |
| **004** | `004-ab-train-grpo-vs-trace-replay-dpo` | **Given** the trace dataset from 002, **when** we train two Qwen3-7B variants — (A) plain GRPO baseline, (B) GRPO + trace-replay-DPO — and evaluate on SWE-bench-lite, **then** variant (B) outperforms (A) by ≥ 2 pt pass@1 with statistical significance. | The terminal experiment that validates or invalidates the v0.0 claim. | 📋 planned |
## Spike order rationale
1. **001 (teacher cost) first** — single most likely thing to kill the framework. Cheap to run (~$5–20), takes ~1 hour, no GPU.
2. **002a / 002b in parallel** — independent feasibility checks for the two competing trace-collection substrates. ~half a day each. Compare verdicts head-to-head.
3. **003 reward-shape check** — once we have *any* trace + teacher data, validate the DPO-pair extraction works as a reward signal before paying for the full A/B training run.
4. **004 the actual experiment** — only run after 001/002/003 all green. Costs the GPU budget; should not be wasted on a framework that already failed an earlier feasibility gate.
## Out of scope for v0.0 (deferred to v0.1)
- **Composer hint-distill = SDPO/OPSD** (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. **Code is published** at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD); paper [arXiv:2601.20802](https://arxiv.org/abs/2601.20802). Lift the loss for v0.1 — see `docs/COMPOSER_RECIPE_MAPPING.md` § "Implementation handles for v0.1" for the concrete plan.
- The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
- DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
- Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
- MoE base (use dense Qwen3-7B; saner v0.0 target)
- VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
**Why deferring SDPO/hint-distill to v0.1 is the right call:**
1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.
v0.1 will run a 3-arm A/B: **RLVR** vs. **RLVR + SDPO** vs. **RLVR + SDPO + trace-replay-DPO** at 32B once we know v0.0's trace-replay verdict.
## Budget
| Item | Estimate | Source |
|---|---|---|
| Teacher API calls (OpenRouter) | ~$50–150 | 100 traces × ~50 step replays × 3 teachers × ~$0.005/call |
| GPU compute (Qwen3-7B fine-tune × 2 variants) | ~$60–120 | Modal A100-80GB, ~8 hr each variant |
| Dev wallclock | ~5–7 days | Single operator |
| **Total** | **~$200 + dev time** | Cheapest viable falsification of the novel claim |
## Success criteria for v0.0
- 001: $/trace + s/step verdict in `001-teacher-replay-cost/README.md`
- 002a, 002b: clean JSONL + verdict on which substrate to use for v0.1
- 003: DPO-pair stats verdict
- 004: A/B pass@1 with confidence interval, plain text and chart
If 004 is **VALIDATED** → publish the result, write v0.1 plan.
If **PARTIAL** (e.g., only some teacher mixes work) → narrow the claim, re-spike with the working subset.
If **INVALIDATED** → close the trace-replay channel as a research direction; v0.1 framework still ships with Composer-only recipe.
## Citations
All five primary research notes (`research/01..05*.md`) cite the source papers and code repos that informed each design choice. Particular emphasis for spike-time:
- Cursor (2026): Composer 2.5 blog post — recipe shape and the targeted-RL hint-distillation idea
- Microsoft (2024): rStar / rStar-Math — closest precedent to trace-replay (single-teacher MCTS)
- Hugging Face (2025): TRL `GRPOTrainer` + OpenEnv integration — algorithm reference
- Prime Intellect (2026): PRIME-RL + INTELLECT-2 — production decentralized substrate