# Wave 14 Adversarial Cross-Model Review **Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task) **Date:** 2026-05-26 **Method:** Read every Wave 13 finding, every Wave 14 closure, all 4 doc files, **cloned PRIME-RL upstream to verify T4 claims**, ran 61 wave-relevant tests. ## Top-line verdict **CONDITIONAL PASS with 1 BLOCKER + 4 SUGGESTIONs + 2 NITs.** Wave 14 closes Wave 13 BLOCKER 2 (T1 — compose_loss kwargs) and Suggestion 3 (T2 — replaysim) cleanly. T3 (MockManager surface audit) is solid but only tests `world_size=1`. **T4 (PRIME-RL "real GRPO + DPPO") does not match PRIME-RL's actual `default_loss_fn`** despite claiming to mirror it; that error has been pasted into USER_GUIDE.md, INTEGRATION_RECIPES.md, and API_REFERENCE.md. Same signal-to-noise as Wave 11 + Wave 13 reviewers: 1 genuine BLOCKER. --- ## Finding 1 — BLOCKER: T4 PRIME-RL "DPPO importance-sampling-ratio clip" is neither importance sampling nor matches PRIME-RL. **Severity:** BLOCKER **Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:118-131`. The implementation computes ```python grpo_loss = -(advantages * trainer_lp * keep_mask).sum() / keep_mask.sum() ``` That's **pure REINFORCE-with-advantage** — the masking gate is the only nod toward DPPO; there is no importance-sampling ratio multiplication anywhere. **Real PRIME-RL** (`/tmp/prime-rl-clone/src/prime_rl/trainer/rl/loss.py:128-153`, the `default_loss_fn` on `main` as of 2026-05-26): ```python log_importance_ratio = trainer_logprobs - inference_logprobs importance_ratio = torch.exp(log_importance_ratio) probs_diff = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs) positive_advantages = advantages > 0 dppo_invalid_mask_high = probs_diff > loss_config.dppo_mask_high dppo_invalid_mask_low = probs_diff < -loss_config.dppo_mask_low dppo_invalid_mask = torch.where(positive_advantages, dppo_invalid_mask_high, dppo_invalid_mask_low) keep_mask = loss_mask & ~dppo_invalid_mask pg_loss = -(keep_mask * advantages * importance_ratio).sum() # NO division kl_loss = adv_tau * (log_importance_ratio**2 * keep_mask).sum() # KL term ``` **Three concrete divergences from Wave 14's implementation:** 1. **Mask gate is on `probs_diff`** (a probability-space quantity), NOT `log_ratio` (a log-space quantity). These have different magnitudes: `probs_diff=0.2` corresponds to `log_ratio≈log(1.2)≈0.18` for a trainer prob of 1.0 vs inference prob of 0.8. With our `log_ratio>4.0` gate, the mask never fires for normal training distributions; PRIME-RL's `probs_diff>0.2` gate fires routinely. 2. **PRIME-RL multiplies by `importance_ratio = exp(log_ratio)`**; Wave 14 multiplies by `trainer_lp` directly. This is the difference between actual policy-gradient correction (PRIME-RL) and naive REINFORCE. 3. **PRIME-RL's mask is sign-conditioned on advantage** (positive advantages clipped against `dppo_mask_high`, negative against `-dppo_mask_low`); Wave 14 ORs them together unconditionally. **Plus:** the KL term is missing entirely. **Plus:** the defaults claimed as "PRIME-RL's defaults" — `dppo_mask_high=4.0, dppo_mask_low=-4.0` — are wrong. PRIME-RL's `DefaultLossConfig` (`configs/trainer.py:412-424`) sets `dppo_mask_high=0.2, dppo_mask_low=0.2` with `Field(..., ge=0)` validation that would *reject* a negative value. PRIME-RL's code negates at use site: `probs_diff < -loss_config.dppo_mask_low`. **Plus:** the docstring (`composer_loss.py:32-49`), USER_GUIDE.md:599-608, INTEGRATION_RECIPES.md:426-429 + 482-487, and API_REFERENCE.md:1364 all repeat the wrong formula and the wrong "matches PRIME-RL" claim. **Fix direction:** Either (a) actually mirror `default_loss_fn` (mask on `probs_diff`, multiply by `importance_ratio`, add KL term, advantage- conditioned mask, `.sum()` reduction with token-count returned for caller-side scaling), or (b) drop the "matches PRIME-RL" framing and rename to "REINFORCE-with-advantage stub + log-ratio mask" everywhere. Wave 13 Finding 6 is **not actually closed** by Wave 14. --- ## Finding 2 — SUGGESTION: ADR-007 still says Wave 14 hasn't done the integration. **Severity:** SUGGESTION **Evidence:** `docs/adrs/ADR-007-self-distillation-losses.md:104-122` reads: > **Wave 14+ work — `compose_loss` integration is NOT in this wave** > ... Wave 14 plan: add the four kwargs ... But Wave 14 *did* add them (verified — `loss.py:80-93`). The ADR was written defensively after Wave 13 review and never updated when T1 landed. **Net effect:** a user reading ADR-007 is told the SimPO/TAID kwargs don't work; a user reading USER_GUIDE/API_REFERENCE is told they do. **Fix direction:** flip ADR-007 status section to "Closed in Wave 14 — see test_compose_loss_integration.py". --- ## Finding 3 — SUGGESTION: ModalExecutor instantiation example in INTEGRATION_RECIPES is dead code. **Severity:** SUGGESTION **Evidence:** `docs/INTEGRATION_RECIPES.md:519-533` shows ```python executor = ModalExecutor(app="composer-prime-rl") executor.launch_replicas(...) ``` But `composer_replication/diloco/serverless/modal.py:64-66` raises `NotImplementedError` from `__init__`. Same pattern in `HFJobsExecutor`. The recipe doc warns about skeleton-status much further down (line 731), but the inline code example at line 519 will break the moment a reader copy-pastes it. Wave 13 Finding 7 noted this softness; Wave 14 made it worse by writing example code that calls a constructor that always raises. **Fix direction:** in every code block that calls `ModalExecutor(...)`, prepend a comment `# Wave 14: skeleton — raises NotImplementedError` or flip examples to `LocalProcessExecutor`. --- ## Finding 4 — SUGGESTION: MockManager + DiLoCo integration test only exercises `world_size=1`. **Severity:** SUGGESTION **Evidence:** `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py:44-51`, `:108-109`, `:161`. Both `test_mockmanager_diloco_outer_round_completes` and `test_mockmanager_diloco_two_outer_rounds_step_counter` use `world_size=1`. With one replica, `ObjectStoreAllReduce.allreduce` returns the tensor unchanged (its own mean), so an averaging bug in the multi-replica path could not be caught by this test. The pseudo-gradient sign convention is pinned by the unrelated spike-008 test, but **no test combines MockManager + DiLoCo + multi-process** — i.e. the actual deployment scenario is unverified end-to-end. Wave 13 Finding 4 is closed in spirit (call surface is now exhaustive) but not in the deepest sense. **Fix direction:** add one multi-process test that spawns `n_replicas` subprocesses, each constructing `MockManager(store) → make_diloco_outer_loop`, and asserts that after one outer round all replicas converge to the same parameter values (i.e. averaging actually happened). --- ## Finding 5 — SUGGESTION: T4 unit tests pin the wrong implementation as ground truth. **Severity:** SUGGESTION **Evidence:** `composer_replication/recipes/prime_rl/tests/test_composer_loss.py:90-128` (`test_dppo_mask_clips_extreme_ratios`). The expected value `1.5/3` is computed against the buggy formula (Finding 1). The 10 PRIME-RL tests all pass — but they're testing self-consistency, not parity with PRIME-RL. A reader looking at "10 unit tests, all green" infers correctness; correctness is not what they verify. This is the kind of test honesty failure that Wave 11 + Wave 13 reviewers found in different forms. **Fix direction:** add at least one test whose expected value is hand-computed from `default_loss_fn` in PRIME-RL (or import + invoke `default_loss_fn` if the dependency is available, mark the test `@pytest.mark.skipif(not _HAS_PRIME_RL)`). --- ## Finding 6 — NIT: README/test-count drift. Wave 14 task description claims "124 tests passing as of Wave 14"; actual `pytest --collect-only` reports **134 collected**. Of those, the 61-test wave-relevant subset all pass. Not a defect, but the headline number is now off in the same way Wave 13's "9 multi-process tests" was off. --- ## Finding 7 — NIT: `loss_fn` docstring claims "DPPO importance-sampling-ratio clipping — implemented" (`composer_loss.py:9`). Implementation contains no importance-ratio multiplication anywhere. Even if Finding 1 is rejected and the team decides "PRIME-RL match isn't a goal", the docstring is internally false: it announces ISR clipping in a function that does not multiply by `exp(log_ratio)`. --- ## Cross-cutting The four doc subagents wrote internally consistent text but inherited T4's mathematical error. **Three of the four doc files repeat the same wrong formula verbatim.** This is exactly the failure mode Wave 11/13 reviewers flagged: parallel subagents cross-citing each other rather than the upstream source of truth. The 61 tests in the Wave-14-touched dirs pass cleanly. T1, T2, and T3 are real closures with real coverage. The framework is in a **better** state than end-of-Wave-13 — but it has not actually closed Wave 13 Finding 6, and it has propagated a subtler version of the same mathematical-mismatch bug into the user-facing documentation. --- ## Summary scorecard | Wave 13 Finding | Wave 14 status | Verdict | |---|---|---| | BLOCKER 1 (PRIME-RL SDPO degenerate) | Fixed parent-side; channel 2 raises NotImplementedError | ✅ closed | | BLOCKER 2 (compose_loss kwargs not added) | T1 added them + 11 integration tests | ✅ closed | | Suggestion 3 (replaysim YAML field types) | T2 dual-shape reshape + real DJ e2e + caught related bug | ✅ closed | | Suggestion 4 (MockManager → DiLoCo gap) | T3 surface audit + integration test | 🟡 closed for `world_size=1`; multi-process unverified | | Suggestion 5 ("9 multi-process tests" inflated count) | Not addressed | 🟡 carried over | | Suggestion 6 (PRIME-RL channel 1 REINFORCE not GRPO) | T4 thought it closed this | ❌ **NOT closed — mathematically wrong** | | Suggestion 7 (Modal/HFJobs skeleton clarity) | Made worse by INTEGRATION_RECIPES dead code | 🟡 regression | | NIT 8 (SimPO test positive log-probs) | Not addressed | 🟡 carried over | ## Wave 14b follow-up (2026-05-26) After Wave 14b closed Finding 1 by re-reading PRIME-RL upstream and matching `default_loss_fn` byte-for-byte, the Wave 14b subagent flagged a **new** structural issue not in the Wave 14 review: **PRIME-RL's `setup_loss_fns` (upstream `loss.py:320-327`) expects the custom loss function to return `LossOutputs(loss, metrics={...})`, not a bare scalar tensor.** Our recipe still returns a bare scalar. This predates Wave 14 (it's been wrong since the recipe was first written in Wave 13) but was never caught because no test runs against actual PRIME-RL. **Status:** documented; deferred to Wave 15. Not blocking for Wave 14b's closure of Finding 1, because the formula now matches upstream — the return-shape mismatch is a separate adapter-level issue. Tests still pass because they invoke our `loss_fn` directly without going through PRIME-RL's `compute_loss` pipeline. **Fix direction (Wave 15):** wrap the return value in a duck-typed `LossOutputs` (provided by PRIME-RL when installed; substituted with a NamedTuple shim when not). Add an integration smoke test against PRIME-RL to catch this and similar adapter-shape regressions. ## Final Wave 14 + 14b status | Wave 13 / 14 finding | Wave 14b status | |---|---| | W13 BLOCKER 1: PRIME-RL SDPO degenerate | ✅ closed (parent, channel 2 deferred) | | W13 BLOCKER 2: compose_loss kwargs not added | ✅ closed (Wave 14 T1) | | W13 Suggestion 3: replaysim YAML field types | ✅ closed (Wave 14 T2) | | W13 Suggestion 4: MockManager → DiLoCo gap | ✅ closed (Wave 14 T3 + Wave 14b multi-process test) | | W13 Suggestion 6: PRIME-RL channel 1 REINFORCE not GRPO | ✅ **closed in Wave 14b** (matches upstream `default_loss_fn`) | | W14 Finding 1: PRIME-RL impl wrong | ✅ closed in Wave 14b | | W14 Finding 2: ADR-007 stale | ✅ closed in Wave 14b | | W14 Finding 3: ModalExecutor dead code | ✅ closed in Wave 14b | | W14 Finding 4: world_size=1 only | ✅ closed in Wave 14b (multi-process convergence test) | | W14 Finding 5: tests pin wrong impl as ground truth | ✅ closed in Wave 14b (parity test added) | | W14 NIT 6: test count drift | 🟡 carried | | W14 NIT 7: docstring claims ISR clipping | ✅ closed in Wave 14b (real ISR now implemented) | | **NEW (Wave 14b)**: PRIME-RL `LossOutputs` return shape | 🟡 deferred to Wave 15 | **Tests as of Wave 14b: 115 passing + 1 skip-marked (OPSD parity test, runs when upstream cloned).** (Wave 12: 72; Wave 13: 93; Wave 14: 124; Wave 14b: 130; Wave 15: 115 after TAID rewrite consolidation + OPSD parity.)