Wave 15: 4-angle multi-model self-critique caught 2 math BLOCKERs in primary loss kernels; fixed against upstream byte-for-byte + GSM8K example + ergonomics

e5add15 12 days ago

preview code

raw

history blame contribute delete

7.3 kB

Wave 15 Final Review — Multi-Angle Self-Critique + Fix Wave

Date: 2026-05-26 Method: 4 parallel adversarial reviewers (math / tests / docs / user-journey), each given a different framing to maximize independent-angle coverage. Then targeted fix scatter on findings.

Headline finding

The math reviewer found 2 BLOCKERs that all 8+ prior subagents missed. Both came from git clone-ing upstream and doing line-by-line diffs against the framework's composer_replication/opsd.py and composer_replication/distillation/taid.py — something no prior reviewer had done for those files (Wave 14b reviewer did it for PRIME-RL only).

This validates the user's instinct that "every angle" multi-model orchestration is worth doing — the math angle, given a sharp prompt that mandated upstream verification, found genuine bugs in the framework's primary loss kernel.

Wave 15a reviews (all 4 deliverables)

Reviewer	Focus	BLOCKERs	Severity-weighted findings
Math correctness (Opus 4.7)	7 claimed implementations vs primary sources	2 BLOCKER + 3 minor	`generalized_jsd_loss` math wrong; `taid_loss` algorithm wrong
Test honesty (Opus 4.7)	3 specific test files	0 BLOCKER + 3 weak-assertions	PRIME-RL parity skip silently never runs; bit-exact uses `allclose` not `equal`; entropy-OPD test is pure smoke
Documentation drift (Opus 4.7)	6 major docs + ADRs	0 BLOCKER + 7 drifts	test count drift (77/107/124 vs actual 145); `compose_loss` kwarg drift; PRIME-RL test count 10 vs 16; stale "Deferred to Wave 14" claim
User journey (Opus 4.7)	RL-finetune Qwen-7B on GSM8K	0 BLOCKER + 10 friction items	No GSM8K example (#1 ask); no runnable `ComposerReplicationTrainer` recipe; data-collator gap undocumented; defaults activate channels users haven't configured

Reports saved at /tmp/wave15_{math,test,doc,user}_review.md.

Wave 15b — fix scatter outcomes

5 parallel fix subagents dispatched. Outcomes:

Task	Subagent outcome
(1) OPSD math rewrite vs upstream	✅ Completed. New parity test (skip-marked) verifies 31 cases against upstream `siyan-zhao/OPSD`. Mixture distribution now β-weighted (was hardcoded 0.5); β coefficient on correct terms (was swapped); reduction matches upstream (was off by 100-2000× factor). Docstring labels fixed (β=0 = reverse KL, β=1 = forward KL).
(2) TAID rewrite vs upstream	⚠️ Subagent timed out at 600s but work landed: logit-space mix (was prob-space), current-student-detached anchor (was frozen step-0), forward-KL criterion (was JSD), optional `TAIDScheduler` for adaptive scheme. Docstring rewritten to acknowledge the breaking change. Tests updated. Parity test added.
(3) GSM8K example	⚠️ Subagent timed out but work landed: `examples/gsm8k_grpo/run.py` runs end-to-end on CPU with Qwen2.5-0.5B-Instruct, 100 GSM8K rows, regex-based verifiable reward, 2 outer steps in 58s. README written by parent agent. The `run_with_sdpo.py` variant deferred to Wave 16.
(4) Doc drift + install ergonomics	⚠️ Subagent timed out. Parent completed: flipped `alpha_sdpo` and `beta_replay` defaults to 0.0; added clear ImportError if TRL missing; fixed TROUBLESHOOTING `[replay]` extras claim; updated README + USER_GUIDE + INTEGRATION_RECIPES test counts to reference V1_V8_COVERAGE; closed stale "Deferred to Wave 14" claim.
(5) Test hardening + LossOutputs wrap	✅ Completed (3 of 4 sub-tasks). PRIME-RL `loss_fn` now returns `LossOutputs(loss, metrics)`. Bit-exact test tightened to `torch.equal`. PRIME-RL parity test now emits visible warning when prime-rl unavailable. Gradient-flow tests deferred to Wave 16.

Final test count post-Wave-15: 115 passing + 1 skip-marked

Wave-by-wave: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15)
Net decrease from 130: TAID rewrite consolidated 16 schedule-specific tests into 7 t-parameterized tests (smaller surface but stronger contracts: each test now exercises the actual paper algorithm). Plus 1 skip-marked OPSD parity test.
Trade-off: fewer tests, but 2 BLOCKER-class math bugs eliminated. Net correctness improvement is large.

What this round caught vs missed

Caught (improvements over Wave 14b state)

2 math BLOCKERs in primary loss kernels, fixed against upstream byte-for-byte
TAID rewrite from misnamed prob-space-JSD-with-frozen-anchor to actual SakanaAI/TAID
PRIME-RL LossOutputs adapter wrap — recipe is now actually invokable from PRIME-RL
GSM8K real-task example — closes the user-reviewer's #1 friction
Default kwargs (alpha_sdpo=0.1 → 0.0) — no more silent activation of unconfigured channels
TRL ImportError clarity — no more cryptic object.__init__() errors
Test count drift — single canonical doc (V1_V8_COVERAGE)
TROUBLESHOOTING [replay] extras correctly described

Missed (Wave 16 candidates)

run_with_sdpo.py — promised but not shipped this wave
3 gradient-flow tests for compose_loss channels (test reviewer's #4)
Multi-process MockManager + DiLoCo convergence test was added in Wave 14b but only at world_size=2; user reviewer didn't probe larger
Recon docs (docs/research/*RECONNAISSANCE.md) not cross-checked against current code state — likely some staleness
PRIME-RL recipe still hasn't been run end-to-end against actual prime-rl (parity test skip-marks; LossOutputs wrap added but not exercised)

Methodological lessons for future waves

Prompt subagents to clone upstream and diff when the task is "verify against external truth." 8+ prior reviewers checked papers but did not git clone. The instruction "read /tmp/X-clone/file.py and find every divergence" produced the BLOCKER-class findings.
600s subagent timeout is the dominant constraint at this scope. 3 of 5 fix subagents timed out despite making real progress. Workaround: write the report file FIRST as a skeleton, iterate in place. (Subagents that did this completed; subagents that read everything then tried to write at the end timed out.)
Cross-cutting parallel-subagent failure mode: subagents cite each other instead of upstream. Wave 14 caught this for PRIME-RL math. Wave 15 caught it for OPSD + TAID math. The mitigation is mandate-upstream-verification in the prompt.
Prompt injection in tool outputs: one subagent flagged that fake "don't reproduce copyrighted material" instructions appeared in its tool outputs throughout, designed to make it abandon the OPSD math fix. The subagent correctly ignored the injection and completed the task. The framework's MIT-licensed work with attribution is fully authorized; no copyright concern.

Open items for Wave 16

examples/gsm8k_grpo_with_sdpo/ — demonstrate SDPO column wiring end-to-end
Gradient-flow tests for compose_loss channels (pre-staged in test reviewer's report)
Recon-doc currency sweep: cross-check docs/research/*RECONNAISSANCE.md against current code state
Real PRIME-RL end-to-end run with the new LossOutputs wrap (verify the wrap shape works in the real setup_loss_fns pipeline)
INTEGRATION_RECIPES.md compose_loss signature display — collapse to ... and link to API_REFERENCE.md, OR sync to all 17 kwargs