Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 13 days ago

10.4 kB

Wave 13 Adversarial Cross-Model Review

Reviewer: Claude Opus 4.7 (sub-agent via delegate_task) Date: 2026-05-26 Scope: Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules) Method: Read-and-grep audit + targeted test runs (CPU)

Top-line verdict

CONDITIONAL PASS with two BLOCKERs. Wave 13 substantially advances the brief expansion (serverless DiLoCo abstraction, replaysim normalization, three distillation losses, PRIME-RL recipe, Monarch tie-in). The distillation losses are the strongest deliverable — real, well-tested, mathematically faithful to the cited papers. The serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are also genuine and exercised by 3 real multi-process tests.

However, two material claims are not test-validated, and one new module silently produces a degenerate loss in its primary code path. ADR claims that say "X is added to compose_loss" describe code that wasn't actually written. The MockManager → DiLoCo "drop-in" is unverified end-to-end.

Wave 11's reviewer found 2 genuine BLOCKERs. This review finds 2 BLOCKERs + 4 SUGGESTIONs + 2 NITs.

Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)

Severity: BLOCKER Evidence: composer_replication/recipes/prime_rl/composer_loss.py:79-86

The PRIME-RL composer-loss adapter applies unsqueeze(-1) to (B, T) log-prob tensors before passing them to generalized_jsd_loss, which calls F.log_softmax(..., dim=-1). Softmax of a single-element vector is exactly 1.0; its log is 0. Therefore both student_log_probs and teacher_log_probs are identically zero, the JSD between them is 0, and the SDPO contribution is always 0 regardless of alpha_sdpo or the actual log-prob values.

>>> import torch.nn.functional as F
>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])

The docstring calls this "a deliberate approximation," but it is not an approximation — it's a mathematically degenerate operation that silently disables channel 2.

Fix direction:

Gate the SDPO branch behind len(trainer_lp.shape) >= 3, raising NotImplementedError until PRIME-RL surfaces full logits.
Update prime_rl_recipe.md and ADR-006 to stop claiming PRIME-RL has working SDPO; mark it deferred.

Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added

Severity: BLOCKER Evidence:

docs/adrs/ADR-007-self-distillation-losses.md:103-108 claims:
composer_replication.compose_loss gets new optional kwargs:
- dpo_variant: Literal["dpo", "simpo"] = "dpo" — switches channel 3
- sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none" — wraps channel 2
- taid_schedule_step: int | None = None
- taid_total_steps: int | None = None
composer_replication/loss.py:54-65 actual signature has none of these. grep -n "dpo_variant\|sdpo_wrapper\|taid" composer_replication/loss.py returns empty.

The new losses live in composer_replication.distillation as standalone functions but are not wired into the framework's actual loss composition. A user reading ADR-007 + the README would believe compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...) works; it would raise TypeError. The 17 distillation tests verify the standalone losses but never exercise integration.

Fix direction:

Either (a) add the kwargs to compose_loss and write at least one integration test combining e.g. SDPO+TAID (~30 LOC change), or
(b) downgrade ADR-007 status to "Standalone losses landed; integration deferred to Wave 14."

Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields

Severity: SUGGESTION (would be BLOCKER if a test exercised the real path) Evidence:

composer_replication/recipes/replaysim/default.yaml configures text_length_filter, words_num_filter, special_characters_filter, document_deduplicator with text_keys: ["chosen", "rejected"].
In the record produced by _dpo_pair_to_dj_record, chosen and rejected are lists of dicts ([{"role": "assistant", "content": "..."}]) — not strings.
data-juicer's text_length_filter expects string-typed fields; running it on a list will either crash or no-op silently.

The reason no test catches this: tests only validate the real path if data-juicer is installed, and even then only check __init__ succeeds. There is no test that calls normalize() against a real data-juicer executor with the default recipe.

Fix direction:

Reshape _dpo_pair_to_dj_record to extract content strings alongside the messages-format list.
Add one test (skip-marked unless data_juicer is importable) that runs the real op-graph on 3 hand-crafted records.

Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end

Severity: SUGGESTION Evidence:

composer_replication/diloco/serverless/allreduce.py:188-191 claims MockManager "drops into" make_diloco_outer_loop.
The only test covering MockManager (test_mock_manager_shape_compat) is a hasattr smoke that calls .allreduce on a world_size=1 store (passthrough).
torchft.Manager has additional surface area (current_step, is_leader, _pg, report_error, internal step accounting) that DiLoCo's _apply_pseudogradient may consult depending on version.

Fix direction:

Add a single integration test that constructs make_diloco_outer_loop(manager=MockManager(store), ...) against a tiny nn.Linear and runs one outer round — even single-process.
Audit torchft/local_sgd.py for the Manager-rooted call sites and add stubs for any methods DiLoCo actually consults beyond allreduce.

Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated

Severity: SUGGESTION (NIT bordering) Evidence:

README.md and V1_V8_COVERAGE both state: "9 multi-process tests pinning the allreduce barrier."
Actual breakdown:
- 4 single-process unit tests + test_mock_manager_shape_compat (5)
- 4 multi-process tests spawning subprocesses (parametrized [2,3] of _runs_allreduce_across_replicas, _handles_multiple_rounds, _reports_failed_replicas)
Of the 4 multi-process tests, only 3 actually exercise the allreduce barrier; _reports_failed_replicas deliberately raises before any allreduce call.

Wave 13 clearly does NOT fake-pass via world_size=1 — the multi- process barrier is real. But the count is rounded up.

Fix direction: Replace "9 multi-process tests" with "9 tests covering the serverless DiLoCo layer, of which 4 spawn real subprocesses and 3 exercise the allreduce barrier across replicas."

Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`

Severity: SUGGESTION Evidence: composer_replication/recipes/prime_rl/composer_loss.py:62-68 computes:

grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)

This is plain REINFORCE with advantage. PRIME-RL's LossInputs exposes inference_logprobs precisely because GRPO-with-replay-buffer requires the importance-sampling ratio exp(trainer_lp - inference_lp) (PPO-style clipped objective).

The file says "SKELETON" so this isn't a hidden bug per se, but the loss is labeled GRPO and is not GRPO.

Fix direction: Either implement the ratio + clipping (~20 LOC) or rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.

Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `init`

Severity: NIT (this is documented, but README phrasing is slightly soft) Evidence: Honestly documented as skeletons in the code, ADR-005, and README. NIT: a user trying ModalExecutor() gets a runtime error rather than an import-time clue.

Fix direction: Low priority. Update README phrase to "skeleton-only — raises NotImplementedError until v0.x." Or use a __getattr__ on the package that raises a clearer message.

Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)

Severity: NIT Evidence: test_distillation_losses.py:27-46 calls simpo_loss with chosen=tensor([0.5, 0.4, 0.3]). Log-probabilities are bounded above by 0; positive values aren't possible from any softmax. The tests still verify the formula correctly, but the test inputs aren't legal.

Fix direction: Use negative values — purely cosmetic.

Cross-cutting risk check

73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still pass. The new composer_replication.diloco.serverless package is purely additive; the existing make_diloco_outer_loop is untouched. No cross-wave regressions detected on CPU. GPU tests + slow CPU e2e tests not re-run; regression risk low since Wave 13 doesn't touch their dependencies.

Summary scorecard

Item	Verdict
Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone	✅ Real, well-tested, paper-faithful
Distillation integrated into `compose_loss`	❌ Not implemented despite ADR-007 (Finding 2)
ObjectStoreAllReduce + LocalProcessExecutor	✅ Real multi-process barrier validated
MockManager → DiLoCo drop-in	🟡 Shape-checked only; integration unverified (Finding 4)
Modal/HFJobs adapters	🟡 Honestly documented as skeletons (Finding 7)
Replaysim DJNormalizer passthrough	✅ Works
Replaysim default.yaml against real data-juicer	❌ Recipe field types don't match record shape (Finding 3)
PRIME-RL composer_loss.loss_fn	❌ SDPO term silently 0 (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6)
Monarch actors	✅ Honest skeleton; raises NotImplementedError
Altered-minds tie-in doc	✅ Design-only, scoped honestly
35 new tests	All pass; 3 of 4 multi-process tests are genuine (Finding 5)

Recommendation: Address Findings 1 and 2 before publishing the Wave 13 expansion as "closed." Findings 3 and 4 should be addressed before any user attempts the real data-juicer or real torchft DiLoCo path. Findings 5–8 are cleanup.

Wave 13 Adversarial Cross-Model Review

Top-line verdict

Finding 1 — BLOCKER: PRIME-RL composer_loss.loss_fn SDPO term is mathematically degenerate (always 0)

Finding 2 — BLOCKER: ADR-007 declares compose_loss kwargs that were never added

Finding 3 — SUGGESTION: default.yaml replaysim recipe uses string ops on list-of-dict fields

Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end

Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated

Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores inference_logprobs

Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with NotImplementedError in __init__

Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)

Cross-cutting risk check

Summary scorecard

Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)

Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added

Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields

Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`

Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `init`