composer-replication-framework / docs /research /WAVE_13_FINAL_REVIEW.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31
# Wave 13 Adversarial Cross-Model Review
**Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task)
**Date:** 2026-05-26
**Scope:** Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules)
**Method:** Read-and-grep audit + targeted test runs (CPU)
## Top-line verdict
**CONDITIONAL PASS with two BLOCKERs.** Wave 13 substantially advances
the brief expansion (serverless DiLoCo abstraction, replaysim
normalization, three distillation losses, PRIME-RL recipe, Monarch
tie-in). The **distillation losses are the strongest deliverable**
real, well-tested, mathematically faithful to the cited papers. The
serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are
also genuine and exercised by 3 real multi-process tests.
**However, two material claims are not test-validated, and one new
module silently produces a degenerate loss in its primary code path.**
ADR claims that say "X is added to compose_loss" describe code that
wasn't actually written. The MockManager → DiLoCo "drop-in" is
unverified end-to-end.
Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2
BLOCKERs + 4 SUGGESTIONs + 2 NITs**.
---
## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)
**Severity:** BLOCKER
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:79-86`
The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)`
log-prob tensors before passing them to `generalized_jsd_loss`, which
calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector
is exactly 1.0; its log is 0. Therefore both `student_log_probs` and
`teacher_log_probs` are identically zero, the JSD between them is 0,
and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or
the actual log-prob values.**
```python
>>> import torch.nn.functional as F
>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])
```
The docstring calls this "a deliberate approximation," but it is not
an approximation — it's a mathematically degenerate operation that
silently disables channel 2.
**Fix direction:**
- Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising
`NotImplementedError` until PRIME-RL surfaces full logits.
- Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL
has working SDPO; mark it deferred.
---
## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added
**Severity:** BLOCKER
**Evidence:**
- `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims:
> `composer_replication.compose_loss` gets new optional kwargs:
> - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3
> - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2
> - `taid_schedule_step: int | None = None`
> - `taid_total_steps: int | None = None`
- `composer_replication/loss.py:54-65` actual signature has **none**
of these. `grep -n "dpo_variant\|sdpo_wrapper\|taid"
composer_replication/loss.py` returns empty.
The new losses live in `composer_replication.distillation` as
standalone functions but **are not wired into the framework's actual
loss composition.** A user reading ADR-007 + the README would believe
`compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)`
works; it would raise `TypeError`. The 17 distillation tests verify
the standalone losses but never exercise integration.
**Fix direction:**
- Either (a) add the kwargs to `compose_loss` and write at least one
integration test combining e.g. SDPO+TAID (~30 LOC change), or
- (b) downgrade ADR-007 status to "Standalone losses landed;
integration deferred to Wave 14."
---
## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields
**Severity:** SUGGESTION (would be BLOCKER if a test exercised the real path)
**Evidence:**
- `composer_replication/recipes/replaysim/default.yaml` configures
`text_length_filter`, `words_num_filter`, `special_characters_filter`,
`document_deduplicator` with `text_keys: ["chosen", "rejected"]`.
- In the record produced by `_dpo_pair_to_dj_record`, `chosen` and
`rejected` are **lists of dicts**
(`[{"role": "assistant", "content": "..."}]`) — not strings.
- data-juicer's `text_length_filter` expects string-typed fields;
running it on a list will either crash or no-op silently.
The reason no test catches this: tests only validate the real path *if
data-juicer is installed*, and even then only check `__init__` succeeds.
There is no test that calls `normalize()` against a real data-juicer
executor with the default recipe.
**Fix direction:**
- Reshape `_dpo_pair_to_dj_record` to extract `content` strings
alongside the messages-format list.
- Add one test (skip-marked unless `data_juicer` is importable) that
runs the real op-graph on 3 hand-crafted records.
---
## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end
**Severity:** SUGGESTION
**Evidence:**
- `composer_replication/diloco/serverless/allreduce.py:188-191` claims
MockManager "drops into" `make_diloco_outer_loop`.
- The only test covering MockManager (`test_mock_manager_shape_compat`)
is a `hasattr` smoke that calls `.allreduce` on a `world_size=1`
store (passthrough).
- torchft.Manager has additional surface area
(`current_step`, `is_leader`, `_pg`, `report_error`,
internal step accounting) that DiLoCo's `_apply_pseudogradient`
may consult depending on version.
**Fix direction:**
- Add a single integration test that constructs
`make_diloco_outer_loop(manager=MockManager(store), ...)` against a
tiny `nn.Linear` and runs one outer round — even single-process.
- Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and
add stubs for any methods DiLoCo actually consults beyond `allreduce`.
---
## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated
**Severity:** SUGGESTION (NIT bordering)
**Evidence:**
- README.md and V1_V8_COVERAGE both state: *"9 multi-process tests
pinning the allreduce barrier."*
- Actual breakdown:
- 4 single-process unit tests + `test_mock_manager_shape_compat` (5)
- 4 multi-process tests spawning subprocesses (parametrized [2,3] of
`_runs_allreduce_across_replicas`, `_handles_multiple_rounds`,
`_reports_failed_replicas`)
- Of the 4 multi-process tests, only **3 actually exercise the
allreduce barrier**; `_reports_failed_replicas` deliberately raises
before any allreduce call.
**Wave 13 clearly does NOT fake-pass via world_size=1** — the multi-
process barrier is real. But the count is rounded up.
**Fix direction:** Replace "9 multi-process tests" with "9 tests
covering the serverless DiLoCo layer, of which 4 spawn real
subprocesses and 3 exercise the allreduce barrier across replicas."
---
## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`
**Severity:** SUGGESTION
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:62-68`
computes:
```python
grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
```
This is plain REINFORCE with advantage. PRIME-RL's `LossInputs`
exposes `inference_logprobs` precisely because GRPO-with-replay-buffer
requires the importance-sampling ratio
`exp(trainer_lp - inference_lp)` (PPO-style clipped objective).
The file says "SKELETON" so this isn't a hidden bug per se, but the
loss is **labeled GRPO and is not GRPO**.
**Fix direction:** Either implement the ratio + clipping (~20 LOC) or
rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.
---
## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__`
**Severity:** NIT (this is documented, but README phrasing is slightly soft)
**Evidence:** Honestly documented as skeletons in the code, ADR-005,
and README. NIT: a user trying `ModalExecutor()` gets a runtime error
rather than an import-time clue.
**Fix direction:** Low priority. Update README phrase to "skeleton-only
— raises NotImplementedError until v0.x." Or use a `__getattr__` on
the package that raises a clearer message.
---
## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)
**Severity:** NIT
**Evidence:** `test_distillation_losses.py:27-46` calls `simpo_loss`
with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded
above by 0; positive values aren't possible from any softmax. The tests
still verify the formula correctly, but the test inputs aren't legal.
**Fix direction:** Use negative values — purely cosmetic.
---
## Cross-cutting risk check
73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still
pass. The new `composer_replication.diloco.serverless` package is
purely additive; the existing `make_diloco_outer_loop` is untouched.
**No cross-wave regressions detected on CPU.** GPU tests + slow CPU
e2e tests not re-run; regression risk low since Wave 13 doesn't touch
their dependencies.
---
## Summary scorecard
| Item | Verdict |
|---|---|
| Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone | ✅ Real, well-tested, paper-faithful |
| Distillation integrated into `compose_loss` | ❌ **Not implemented** despite ADR-007 (Finding 2) |
| ObjectStoreAllReduce + LocalProcessExecutor | ✅ Real multi-process barrier validated |
| MockManager → DiLoCo drop-in | 🟡 Shape-checked only; integration unverified (Finding 4) |
| Modal/HFJobs adapters | 🟡 Honestly documented as skeletons (Finding 7) |
| Replaysim DJNormalizer passthrough | ✅ Works |
| Replaysim default.yaml against real data-juicer | ❌ **Recipe field types don't match record shape** (Finding 3) |
| PRIME-RL composer_loss.loss_fn | ❌ **SDPO term silently 0** (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) |
| Monarch actors | ✅ Honest skeleton; raises NotImplementedError |
| Altered-minds tie-in doc | ✅ Design-only, scoped honestly |
| 35 new tests | All pass; 3 of 4 multi-process tests are genuine (Finding 5) |
**Recommendation:** Address Findings 1 and 2 before publishing the
Wave 13 expansion as "closed." Findings 3 and 4 should be addressed
before any user attempts the real data-juicer or real torchft DiLoCo
path. Findings 5–8 are cleanup.