Wave 15: 4-angle multi-model self-critique caught 2 math BLOCKERs in primary loss kernels; fixed against upstream byte-for-byte + GSM8K example + ergonomics

e5add15 12 days ago

preview code

raw

history blame contribute delete

12.7 kB

Wave 14 Adversarial Cross-Model Review

Reviewer: Claude Opus 4.7 (sub-agent via delegate_task) Date: 2026-05-26 Method: Read every Wave 13 finding, every Wave 14 closure, all 4 doc files, cloned PRIME-RL upstream to verify T4 claims, ran 61 wave-relevant tests.

Top-line verdict

CONDITIONAL PASS with 1 BLOCKER + 4 SUGGESTIONs + 2 NITs. Wave 14 closes Wave 13 BLOCKER 2 (T1 — compose_loss kwargs) and Suggestion 3 (T2 — replaysim) cleanly. T3 (MockManager surface audit) is solid but only tests world_size=1. T4 (PRIME-RL "real GRPO + DPPO") does not match PRIME-RL's actual default_loss_fn despite claiming to mirror it; that error has been pasted into USER_GUIDE.md, INTEGRATION_RECIPES.md, and API_REFERENCE.md.

Same signal-to-noise as Wave 11 + Wave 13 reviewers: 1 genuine BLOCKER.

Finding 1 — BLOCKER: T4 PRIME-RL "DPPO importance-sampling-ratio clip" is neither importance sampling nor matches PRIME-RL.

Severity: BLOCKER Evidence: composer_replication/recipes/prime_rl/composer_loss.py:118-131. The implementation computes

grpo_loss = -(advantages * trainer_lp * keep_mask).sum() / keep_mask.sum()

That's pure REINFORCE-with-advantage — the masking gate is the only nod toward DPPO; there is no importance-sampling ratio multiplication anywhere.

Real PRIME-RL (/tmp/prime-rl-clone/src/prime_rl/trainer/rl/loss.py:128-153, the default_loss_fn on main as of 2026-05-26):

log_importance_ratio = trainer_logprobs - inference_logprobs
importance_ratio     = torch.exp(log_importance_ratio)
probs_diff           = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs)
positive_advantages  = advantages > 0
dppo_invalid_mask_high = probs_diff > loss_config.dppo_mask_high
dppo_invalid_mask_low  = probs_diff < -loss_config.dppo_mask_low
dppo_invalid_mask = torch.where(positive_advantages, dppo_invalid_mask_high, dppo_invalid_mask_low)
keep_mask = loss_mask & ~dppo_invalid_mask
pg_loss = -(keep_mask * advantages * importance_ratio).sum()  # NO division
kl_loss = adv_tau * (log_importance_ratio**2 * keep_mask).sum()  # KL term

Three concrete divergences from Wave 14's implementation:

Mask gate is on probs_diff (a probability-space quantity), NOT log_ratio (a log-space quantity). These have different magnitudes: probs_diff=0.2 corresponds to log_ratio≈log(1.2)≈0.18 for a trainer prob of 1.0 vs inference prob of 0.8. With our log_ratio>4.0 gate, the mask never fires for normal training distributions; PRIME-RL's probs_diff>0.2 gate fires routinely.
PRIME-RL multiplies by importance_ratio = exp(log_ratio); Wave 14 multiplies by trainer_lp directly. This is the difference between actual policy-gradient correction (PRIME-RL) and naive REINFORCE.
PRIME-RL's mask is sign-conditioned on advantage (positive advantages clipped against dppo_mask_high, negative against -dppo_mask_low); Wave 14 ORs them together unconditionally.

Plus: the KL term is missing entirely.

Plus: the defaults claimed as "PRIME-RL's defaults" — dppo_mask_high=4.0, dppo_mask_low=-4.0 — are wrong. PRIME-RL's DefaultLossConfig (configs/trainer.py:412-424) sets dppo_mask_high=0.2, dppo_mask_low=0.2 with Field(..., ge=0) validation that would reject a negative value. PRIME-RL's code negates at use site: probs_diff < -loss_config.dppo_mask_low.

Plus: the docstring (composer_loss.py:32-49), USER_GUIDE.md:599-608, INTEGRATION_RECIPES.md:426-429 + 482-487, and API_REFERENCE.md:1364 all repeat the wrong formula and the wrong "matches PRIME-RL" claim.

Fix direction: Either (a) actually mirror default_loss_fn (mask on probs_diff, multiply by importance_ratio, add KL term, advantage- conditioned mask, .sum() reduction with token-count returned for caller-side scaling), or (b) drop the "matches PRIME-RL" framing and rename to "REINFORCE-with-advantage stub + log-ratio mask" everywhere.

Wave 13 Finding 6 is not actually closed by Wave 14.

Finding 2 — SUGGESTION: ADR-007 still says Wave 14 hasn't done the integration.

Severity: SUGGESTION Evidence: docs/adrs/ADR-007-self-distillation-losses.md:104-122 reads:

Wave 14+ work — compose_loss integration is NOT in this wave ... Wave 14 plan: add the four kwargs ...

But Wave 14 did add them (verified — loss.py:80-93). The ADR was written defensively after Wave 13 review and never updated when T1 landed.

Net effect: a user reading ADR-007 is told the SimPO/TAID kwargs don't work; a user reading USER_GUIDE/API_REFERENCE is told they do.

Fix direction: flip ADR-007 status section to "Closed in Wave 14 — see test_compose_loss_integration.py".

Finding 3 — SUGGESTION: ModalExecutor instantiation example in INTEGRATION_RECIPES is dead code.

Severity: SUGGESTION Evidence: docs/INTEGRATION_RECIPES.md:519-533 shows

executor = ModalExecutor(app="composer-prime-rl")
executor.launch_replicas(...)

But composer_replication/diloco/serverless/modal.py:64-66 raises NotImplementedError from __init__. Same pattern in HFJobsExecutor. The recipe doc warns about skeleton-status much further down (line 731), but the inline code example at line 519 will break the moment a reader copy-pastes it.

Wave 13 Finding 7 noted this softness; Wave 14 made it worse by writing example code that calls a constructor that always raises.

Fix direction: in every code block that calls ModalExecutor(...), prepend a comment # Wave 14: skeleton — raises NotImplementedError or flip examples to LocalProcessExecutor.

Finding 4 — SUGGESTION: MockManager + DiLoCo integration test only exercises `world_size=1`.

Severity: SUGGESTION Evidence: composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py:44-51, :108-109, :161. Both test_mockmanager_diloco_outer_round_completes and test_mockmanager_diloco_two_outer_rounds_step_counter use world_size=1.

With one replica, ObjectStoreAllReduce.allreduce returns the tensor unchanged (its own mean), so an averaging bug in the multi-replica path could not be caught by this test. The pseudo-gradient sign convention is pinned by the unrelated spike-008 test, but no test combines MockManager + DiLoCo + multi-process — i.e. the actual deployment scenario is unverified end-to-end.

Wave 13 Finding 4 is closed in spirit (call surface is now exhaustive) but not in the deepest sense.

Fix direction: add one multi-process test that spawns n_replicas subprocesses, each constructing MockManager(store) → make_diloco_outer_loop, and asserts that after one outer round all replicas converge to the same parameter values (i.e. averaging actually happened).

Finding 5 — SUGGESTION: T4 unit tests pin the wrong implementation as ground truth.

Severity: SUGGESTION Evidence: composer_replication/recipes/prime_rl/tests/test_composer_loss.py:90-128 (test_dppo_mask_clips_extreme_ratios). The expected value 1.5/3 is computed against the buggy formula (Finding 1).

The 10 PRIME-RL tests all pass — but they're testing self-consistency, not parity with PRIME-RL. A reader looking at "10 unit tests, all green" infers correctness; correctness is not what they verify. This is the kind of test honesty failure that Wave 11 + Wave 13 reviewers found in different forms.

Fix direction: add at least one test whose expected value is hand-computed from default_loss_fn in PRIME-RL (or import + invoke default_loss_fn if the dependency is available, mark the test @pytest.mark.skipif(not _HAS_PRIME_RL)).

Finding 6 — NIT: README/test-count drift.

Wave 14 task description claims "124 tests passing as of Wave 14"; actual pytest --collect-only reports 134 collected. Of those, the 61-test wave-relevant subset all pass. Not a defect, but the headline number is now off in the same way Wave 13's "9 multi-process tests" was off.

Finding 7 — NIT: `loss_fn` docstring claims "DPPO importance-sampling-ratio clipping — implemented" (`composer_loss.py:9`).

Implementation contains no importance-ratio multiplication anywhere. Even if Finding 1 is rejected and the team decides "PRIME-RL match isn't a goal", the docstring is internally false: it announces ISR clipping in a function that does not multiply by exp(log_ratio).

Cross-cutting

The four doc subagents wrote internally consistent text but inherited T4's mathematical error. Three of the four doc files repeat the same wrong formula verbatim. This is exactly the failure mode Wave 11/13 reviewers flagged: parallel subagents cross-citing each other rather than the upstream source of truth.

The 61 tests in the Wave-14-touched dirs pass cleanly. T1, T2, and T3 are real closures with real coverage. The framework is in a better state than end-of-Wave-13 — but it has not actually closed Wave 13 Finding 6, and it has propagated a subtler version of the same mathematical-mismatch bug into the user-facing documentation.

Summary scorecard

Wave 13 Finding	Wave 14 status	Verdict
BLOCKER 1 (PRIME-RL SDPO degenerate)	Fixed parent-side; channel 2 raises NotImplementedError	✅ closed
BLOCKER 2 (compose_loss kwargs not added)	T1 added them + 11 integration tests	✅ closed
Suggestion 3 (replaysim YAML field types)	T2 dual-shape reshape + real DJ e2e + caught related bug	✅ closed
Suggestion 4 (MockManager → DiLoCo gap)	T3 surface audit + integration test	🟡 closed for `world_size=1`; multi-process unverified
Suggestion 5 ("9 multi-process tests" inflated count)	Not addressed	🟡 carried over
Suggestion 6 (PRIME-RL channel 1 REINFORCE not GRPO)	T4 thought it closed this	❌ NOT closed — mathematically wrong
Suggestion 7 (Modal/HFJobs skeleton clarity)	Made worse by INTEGRATION_RECIPES dead code	🟡 regression
NIT 8 (SimPO test positive log-probs)	Not addressed	🟡 carried over

Wave 14b follow-up (2026-05-26)

After Wave 14b closed Finding 1 by re-reading PRIME-RL upstream and matching default_loss_fn byte-for-byte, the Wave 14b subagent flagged a new structural issue not in the Wave 14 review:

PRIME-RL's setup_loss_fns (upstream loss.py:320-327) expects the custom loss function to return LossOutputs(loss, metrics={...}), not a bare scalar tensor. Our recipe still returns a bare scalar. This predates Wave 14 (it's been wrong since the recipe was first written in Wave 13) but was never caught because no test runs against actual PRIME-RL.

Status: documented; deferred to Wave 15. Not blocking for Wave 14b's closure of Finding 1, because the formula now matches upstream — the return-shape mismatch is a separate adapter-level issue. Tests still pass because they invoke our loss_fn directly without going through PRIME-RL's compute_loss pipeline.

Fix direction (Wave 15): wrap the return value in a duck-typed LossOutputs (provided by PRIME-RL when installed; substituted with a NamedTuple shim when not). Add an integration smoke test against PRIME-RL to catch this and similar adapter-shape regressions.

Final Wave 14 + 14b status

Wave 13 / 14 finding	Wave 14b status
W13 BLOCKER 1: PRIME-RL SDPO degenerate	✅ closed (parent, channel 2 deferred)
W13 BLOCKER 2: compose_loss kwargs not added	✅ closed (Wave 14 T1)
W13 Suggestion 3: replaysim YAML field types	✅ closed (Wave 14 T2)
W13 Suggestion 4: MockManager → DiLoCo gap	✅ closed (Wave 14 T3 + Wave 14b multi-process test)
W13 Suggestion 6: PRIME-RL channel 1 REINFORCE not GRPO	✅ closed in Wave 14b (matches upstream `default_loss_fn`)
W14 Finding 1: PRIME-RL impl wrong	✅ closed in Wave 14b
W14 Finding 2: ADR-007 stale	✅ closed in Wave 14b
W14 Finding 3: ModalExecutor dead code	✅ closed in Wave 14b
W14 Finding 4: world_size=1 only	✅ closed in Wave 14b (multi-process convergence test)
W14 Finding 5: tests pin wrong impl as ground truth	✅ closed in Wave 14b (parity test added)
W14 NIT 6: test count drift	🟡 carried
W14 NIT 7: docstring claims ISR clipping	✅ closed in Wave 14b (real ISR now implemented)
NEW (Wave 14b): PRIME-RL `LossOutputs` return shape	🟡 deferred to Wave 15

Tests as of Wave 14b: 115 passing + 1 skip-marked (OPSD parity test, runs when upstream cloned). (Wave 12: 72; Wave 13: 93; Wave 14: 124; Wave 14b: 130; Wave 15: 115 after TAID rewrite consolidation + OPSD parity.)