# Wave 13 Adversarial Cross-Model Review

**Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task)
**Date:** 2026-05-26
**Scope:** Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules)
**Method:** Read-and-grep audit + targeted test runs (CPU)

## Top-line verdict

**CONDITIONAL PASS with two BLOCKERs.** Wave 13 substantially advances
the brief expansion (serverless DiLoCo abstraction, replaysim
normalization, three distillation losses, PRIME-RL recipe, Monarch
tie-in). The **distillation losses are the strongest deliverable** —
real, well-tested, mathematically faithful to the cited papers. The
serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are
also genuine and exercised by 3 real multi-process tests.

**However, two material claims are not test-validated, and one new
module silently produces a degenerate loss in its primary code path.**
ADR claims that say "X is added to compose_loss" describe code that
wasn't actually written. The MockManager → DiLoCo "drop-in" is
unverified end-to-end.

Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2
BLOCKERs + 4 SUGGESTIONs + 2 NITs**.

---

## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)

**Severity:** BLOCKER
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:79-86`

The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)`
log-prob tensors before passing them to `generalized_jsd_loss`, which
calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector
is exactly 1.0; its log is 0. Therefore both `student_log_probs` and
`teacher_log_probs` are identically zero, the JSD between them is 0,
and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or
the actual log-prob values.**

```python
>>> import torch.nn.functional as F
>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])
```

The docstring calls this "a deliberate approximation," but it is not
an approximation — it's a mathematically degenerate operation that
silently disables channel 2.

**Fix direction:**
- Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising
  `NotImplementedError` until PRIME-RL surfaces full logits.
- Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL
  has working SDPO; mark it deferred.

---

## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added

**Severity:** BLOCKER
**Evidence:**
- `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims:
  > `composer_replication.compose_loss` gets new optional kwargs:
  >   - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3
  >   - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2
  >   - `taid_schedule_step: int | None = None`
  >   - `taid_total_steps: int | None = None`
- `composer_replication/loss.py:54-65` actual signature has **none**
  of these. `grep -n "dpo_variant\|sdpo_wrapper\|taid"
  composer_replication/loss.py` returns empty.

The new losses live in `composer_replication.distillation` as
standalone functions but **are not wired into the framework's actual
loss composition.** A user reading ADR-007 + the README would believe
`compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)`
works; it would raise `TypeError`. The 17 distillation tests verify
the standalone losses but never exercise integration.

**Fix direction:**
- Either (a) add the kwargs to `compose_loss` and write at least one
  integration test combining e.g. SDPO+TAID (~30 LOC change), or
- (b) downgrade ADR-007 status to "Standalone losses landed;
  integration deferred to Wave 14."

---

## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields

**Severity:** SUGGESTION (would be BLOCKER if a test exercised the real path)
**Evidence:**
- `composer_replication/recipes/replaysim/default.yaml` configures
  `text_length_filter`, `words_num_filter`, `special_characters_filter`,
  `document_deduplicator` with `text_keys: ["chosen", "rejected"]`.
- In the record produced by `_dpo_pair_to_dj_record`, `chosen` and
  `rejected` are **lists of dicts**
  (`[{"role": "assistant", "content": "..."}]`) — not strings.
- data-juicer's `text_length_filter` expects string-typed fields;
  running it on a list will either crash or no-op silently.

The reason no test catches this: tests only validate the real path *if
data-juicer is installed*, and even then only check `__init__` succeeds.
There is no test that calls `normalize()` against a real data-juicer
executor with the default recipe.

**Fix direction:**
- Reshape `_dpo_pair_to_dj_record` to extract `content` strings
  alongside the messages-format list.
- Add one test (skip-marked unless `data_juicer` is importable) that
  runs the real op-graph on 3 hand-crafted records.

---

## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end

**Severity:** SUGGESTION
**Evidence:**
- `composer_replication/diloco/serverless/allreduce.py:188-191` claims
  MockManager "drops into" `make_diloco_outer_loop`.
- The only test covering MockManager (`test_mock_manager_shape_compat`)
  is a `hasattr` smoke that calls `.allreduce` on a `world_size=1`
  store (passthrough).
- torchft.Manager has additional surface area
  (`current_step`, `is_leader`, `_pg`, `report_error`,
  internal step accounting) that DiLoCo's `_apply_pseudogradient`
  may consult depending on version.

**Fix direction:**
- Add a single integration test that constructs
  `make_diloco_outer_loop(manager=MockManager(store), ...)` against a
  tiny `nn.Linear` and runs one outer round — even single-process.
- Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and
  add stubs for any methods DiLoCo actually consults beyond `allreduce`.

---

## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated

**Severity:** SUGGESTION (NIT bordering)
**Evidence:**
- README.md and V1_V8_COVERAGE both state: *"9 multi-process tests
  pinning the allreduce barrier."*
- Actual breakdown:
  - 4 single-process unit tests + `test_mock_manager_shape_compat` (5)
  - 4 multi-process tests spawning subprocesses (parametrized [2,3] of
    `_runs_allreduce_across_replicas`, `_handles_multiple_rounds`,
    `_reports_failed_replicas`)
- Of the 4 multi-process tests, only **3 actually exercise the
  allreduce barrier**; `_reports_failed_replicas` deliberately raises
  before any allreduce call.

**Wave 13 clearly does NOT fake-pass via world_size=1** — the multi-
process barrier is real. But the count is rounded up.

**Fix direction:** Replace "9 multi-process tests" with "9 tests
covering the serverless DiLoCo layer, of which 4 spawn real
subprocesses and 3 exercise the allreduce barrier across replicas."

---

## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`

**Severity:** SUGGESTION
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:62-68`
computes:
```python
grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
```

This is plain REINFORCE with advantage. PRIME-RL's `LossInputs`
exposes `inference_logprobs` precisely because GRPO-with-replay-buffer
requires the importance-sampling ratio
`exp(trainer_lp - inference_lp)` (PPO-style clipped objective).

The file says "SKELETON" so this isn't a hidden bug per se, but the
loss is **labeled GRPO and is not GRPO**.

**Fix direction:** Either implement the ratio + clipping (~20 LOC) or
rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.

---

## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__`

**Severity:** NIT (this is documented, but README phrasing is slightly soft)
**Evidence:** Honestly documented as skeletons in the code, ADR-005,
and README. NIT: a user trying `ModalExecutor()` gets a runtime error
rather than an import-time clue.

**Fix direction:** Low priority. Update README phrase to "skeleton-only
— raises NotImplementedError until v0.x." Or use a `__getattr__` on
the package that raises a clearer message.

---

## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)

**Severity:** NIT
**Evidence:** `test_distillation_losses.py:27-46` calls `simpo_loss`
with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded
above by 0; positive values aren't possible from any softmax. The tests
still verify the formula correctly, but the test inputs aren't legal.

**Fix direction:** Use negative values — purely cosmetic.

---

## Cross-cutting risk check

73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still
pass. The new `composer_replication.diloco.serverless` package is
purely additive; the existing `make_diloco_outer_loop` is untouched.
**No cross-wave regressions detected on CPU.** GPU tests + slow CPU
e2e tests not re-run; regression risk low since Wave 13 doesn't touch
their dependencies.

---

## Summary scorecard

| Item | Verdict |
|---|---|
| Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone | ✅ Real, well-tested, paper-faithful |
| Distillation integrated into `compose_loss` | ❌ **Not implemented** despite ADR-007 (Finding 2) |
| ObjectStoreAllReduce + LocalProcessExecutor | ✅ Real multi-process barrier validated |
| MockManager → DiLoCo drop-in | 🟡 Shape-checked only; integration unverified (Finding 4) |
| Modal/HFJobs adapters | 🟡 Honestly documented as skeletons (Finding 7) |
| Replaysim DJNormalizer passthrough | ✅ Works |
| Replaysim default.yaml against real data-juicer | ❌ **Recipe field types don't match record shape** (Finding 3) |
| PRIME-RL composer_loss.loss_fn | ❌ **SDPO term silently 0** (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) |
| Monarch actors | ✅ Honest skeleton; raises NotImplementedError |
| Altered-minds tie-in doc | ✅ Design-only, scoped honestly |
| 35 new tests | All pass; 3 of 4 multi-process tests are genuine (Finding 5) |

**Recommendation:** Address Findings 1 and 2 before publishing the
Wave 13 expansion as "closed." Findings 3 and 4 should be addressed
before any user attempts the real data-juicer or real torchft DiLoCo
path. Findings 5–8 are cleanup.