baladithyab

Wave 4: data collator + loss composition smoke (38/38 tests pass)

157cdba 12 days ago

7.01 kB

	# v0.0 Spike — Composer Replication Framework

	> Decomposed from the framework synthesis (`framework/composer-replication-framework.md`).
	> Goal of v0.0: prove the trace-replay multi-teacher distillation channel adds signal on top of plain GRPO, on the smallest viable model.
	> If the spike validates, we move to v0.1 (full Composer recipe). If it invalidates, the framework still has value (Composer recipe alone) but the novel claim is dead and we reorient.

	## Risk-ordered decomposition

	\| # \| Spike \| Validates (Given / When / Then) \| Why this risk first \| Status \|
	\|---\|-------\|----------------------------------\|---------------------\|--------\|
	\| 001 \| `001-teacher-replay-cost` \| Given a frozen 100-step agentic-coding trace and a state at step `t`, when N=3 frozen teachers (Opus 4.7 / GPT-5 / DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions, then total per-trace teacher cost is < $5 and wallclock per step is < 30 s. \| If teachers cost $50+/trace or take 5 min/step, the channel is unviable regardless of whether it improves training. Kill-switch first. \| 🟢 VALIDATED (2026-05-25): $0.98/trace, p95 lat 20.5s, 0 errors \|
	\| 005 \| `005-integrated-trainer-skeleton` \| Given the SDPO loss math (lifted from `siyan-zhao/OPSD`) and the teacher-disagreement DPO-pair extractor, when we wire them into a `GRPOTrainer` subclass with α/β channel weights, then unit tests cover loss differentiability + correctness, and ablating any channel via α=0/β=0 reduces to GRPO. \| Proves the integration architecture compiles before paying GPU costs. Cheap (no GPU, no API). \| 🟢 SKELETON-VALIDATED + COMPOSITION-VERIFIED: 38/38 unit tests pass; 5-step gradient run on tiny model decreases loss with all 3 channels active \|
	\| 002a \| `002a-trace-collection-trl` \| Given Qwen3-7B base + TRL `GRPOTrainer` + a SWE-bench-lite OpenEnv, when we run 100 rollouts, then all rollouts emit complete `(state_t, action_t, reward_t)` tuples to JSONL with no truncation or schema drift. \| Without a clean trace stream, no signal to replay. Validates TRL+OpenEnv plumbing. \| 📋 planned \|
	\| 002b \| `002b-trace-collection-prime-rl` \| Same as 002a but with PRIME-RL substrate. \| Comparison: which framework's trace export is cleaner? \| 📋 planned \|
	\| 003 \| `003-dpo-pairs-from-disagreement` \| Given N=3 teacher action distributions per trace step and the student's own action, when we extract preference pairs by "majority of teachers > student" + "student > minority", then the resulting DPO dataset has ≥ 5 pairs/trace and a non-trivial KL distance from random pairs. \| The reward shape needs to actually carry signal, not just exist. Spike 005 already verified the extraction logic; spike 003 measures signal density on real traces. \| 📋 planned \|
	\| 004 \| `004-ab-train-grpo-vs-trace-replay-dpo` \| Given the trace dataset from 002, when we train two Qwen3-7B variants — (A) plain GRPO baseline, (B) GRPO + trace-replay-DPO — and evaluate on SWE-bench-lite, then variant (B) outperforms (A) by ≥ 2 pt pass@1 with statistical significance. \| The terminal experiment that validates or invalidates the v0.0 claim. \| 📋 planned \|

	## Spike order rationale

	1. 001 (teacher cost) first — single most likely thing to kill the framework. Cheap to run (~$5–20), takes ~1 hour, no GPU.
	2. 002a / 002b in parallel — independent feasibility checks for the two competing trace-collection substrates. ~half a day each. Compare verdicts head-to-head.
	3. 003 reward-shape check — once we have any trace + teacher data, validate the DPO-pair extraction works as a reward signal before paying for the full A/B training run.
	4. 004 the actual experiment — only run after 001/002/003 all green. Costs the GPU budget; should not be wasted on a framework that already failed an earlier feasibility gate.

	## Out of scope for v0.0 (deferred to v0.1)

	- Composer hint-distill = SDPO/OPSD (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. Code is published at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD); paper [arXiv:2601.20802](https://arxiv.org/abs/2601.20802). Lift the loss for v0.1 — see `docs/COMPOSER_RECIPE_MAPPING.md` § "Implementation handles for v0.1" for the concrete plan.
	- The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
	- DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
	- Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
	- MoE base (use dense Qwen3-7B; saner v0.0 target)
	- VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)

	Why deferring SDPO/hint-distill to v0.1 is the right call:

	1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
	2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
	3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
	4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.

	v0.1 will run a 3-arm A/B: RLVR vs. RLVR + SDPO vs. RLVR + SDPO + trace-replay-DPO at 32B once we know v0.0's trace-replay verdict.

	## Budget

	\| Item \| Estimate \| Source \|
	\|---\|---\|---\|
	\| Teacher API calls (OpenRouter) \| ~$50–150 \| 100 traces × ~50 step replays × 3 teachers × ~$0.005/call \|
	\| GPU compute (Qwen3-7B fine-tune × 2 variants) \| ~$60–120 \| Modal A100-80GB, ~8 hr each variant \|
	\| Dev wallclock \| ~5–7 days \| Single operator \|
	\| Total \| ~$200 + dev time \| Cheapest viable falsification of the novel claim \|

	## Success criteria for v0.0

	- 001: $/trace + s/step verdict in `001-teacher-replay-cost/README.md`
	- 002a, 002b: clean JSONL + verdict on which substrate to use for v0.1
	- 003: DPO-pair stats verdict
	- 004: A/B pass@1 with confidence interval, plain text and chart

	If 004 is VALIDATED → publish the result, write v0.1 plan.
	If PARTIAL (e.g., only some teacher mixes work) → narrow the claim, re-spike with the working subset.
	If INVALIDATED → close the trace-replay channel as a research direction; v0.1 framework still ships with Composer-only recipe.

	## Citations

	All five primary research notes (`research/01..05*.md`) cite the source papers and code repos that informed each design choice. Particular emphasis for spike-time:

	- Cursor (2026): Composer 2.5 blog post — recipe shape and the targeted-RL hint-distillation idea
	- Microsoft (2024): rStar / rStar-Math — closest precedent to trace-replay (single-teacher MCTS)
	- Hugging Face (2025): TRL `GRPOTrainer` + OpenEnv integration — algorithm reference
	- Prime Intellect (2026): PRIME-RL + INTELLECT-2 — production decentralized substrate