Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 13 days ago

10.4 kB

	# Wave 13 Adversarial Cross-Model Review

	Reviewer: Claude Opus 4.7 (sub-agent via delegate_task)
	Date: 2026-05-26
	Scope: Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules)
	Method: Read-and-grep audit + targeted test runs (CPU)

	## Top-line verdict

	CONDITIONAL PASS with two BLOCKERs. Wave 13 substantially advances
	the brief expansion (serverless DiLoCo abstraction, replaysim
	normalization, three distillation losses, PRIME-RL recipe, Monarch
	tie-in). The distillation losses are the strongest deliverable —
	real, well-tested, mathematically faithful to the cited papers. The
	serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are
	also genuine and exercised by 3 real multi-process tests.

	**However, two material claims are not test-validated, and one new
	module silently produces a degenerate loss in its primary code path.**
	ADR claims that say "X is added to compose_loss" describe code that
	wasn't actually written. The MockManager → DiLoCo "drop-in" is
	unverified end-to-end.

	Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2
	BLOCKERs + 4 SUGGESTIONs + 2 NITs**.

	---

	## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)

	Severity: BLOCKER
	Evidence: `composer_replication/recipes/prime_rl/composer_loss.py:79-86`

	The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)`
	log-prob tensors before passing them to `generalized_jsd_loss`, which
	calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector
	is exactly 1.0; its log is 0. Therefore both `student_log_probs` and
	`teacher_log_probs` are identically zero, the JSD between them is 0,
	and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or
	the actual log-prob values.**

	```python
	>>> import torch.nn.functional as F
	>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
	tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])
	```

	The docstring calls this "a deliberate approximation," but it is not
	an approximation — it's a mathematically degenerate operation that
	silently disables channel 2.

	Fix direction:
	- Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising
	`NotImplementedError` until PRIME-RL surfaces full logits.
	- Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL
	has working SDPO; mark it deferred.

	---

	## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added

	Severity: BLOCKER
	Evidence:
	- `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims:
	> `composer_replication.compose_loss` gets new optional kwargs:
	> - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3
	> - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2
	> - `taid_schedule_step: int \| None = None`
	> - `taid_total_steps: int \| None = None`
	- `composer_replication/loss.py:54-65` actual signature has none
	of these. `grep -n "dpo_variant\\|sdpo_wrapper\\|taid"
	composer_replication/loss.py` returns empty.

	The new losses live in `composer_replication.distillation` as
	standalone functions but **are not wired into the framework's actual
	loss composition.** A user reading ADR-007 + the README would believe
	`compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)`
	works; it would raise `TypeError`. The 17 distillation tests verify
	the standalone losses but never exercise integration.

	Fix direction:
	- Either (a) add the kwargs to `compose_loss` and write at least one
	integration test combining e.g. SDPO+TAID (~30 LOC change), or
	- (b) downgrade ADR-007 status to "Standalone losses landed;
	integration deferred to Wave 14."

	---

	## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields

	Severity: SUGGESTION (would be BLOCKER if a test exercised the real path)
	Evidence:
	- `composer_replication/recipes/replaysim/default.yaml` configures
	`text_length_filter`, `words_num_filter`, `special_characters_filter`,
	`document_deduplicator` with `text_keys: ["chosen", "rejected"]`.
	- In the record produced by `_dpo_pair_to_dj_record`, `chosen` and
	`rejected` are lists of dicts
	(`[{"role": "assistant", "content": "..."}]`) — not strings.
	- data-juicer's `text_length_filter` expects string-typed fields;
	running it on a list will either crash or no-op silently.

	The reason no test catches this: tests only validate the real path *if
	data-juicer is installed*, and even then only check `__init__` succeeds.
	There is no test that calls `normalize()` against a real data-juicer
	executor with the default recipe.

	Fix direction:
	- Reshape `_dpo_pair_to_dj_record` to extract `content` strings
	alongside the messages-format list.
	- Add one test (skip-marked unless `data_juicer` is importable) that
	runs the real op-graph on 3 hand-crafted records.

	---

	## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end

	Severity: SUGGESTION
	Evidence:
	- `composer_replication/diloco/serverless/allreduce.py:188-191` claims
	MockManager "drops into" `make_diloco_outer_loop`.
	- The only test covering MockManager (`test_mock_manager_shape_compat`)
	is a `hasattr` smoke that calls `.allreduce` on a `world_size=1`
	store (passthrough).
	- torchft.Manager has additional surface area
	(`current_step`, `is_leader`, `_pg`, `report_error`,
	internal step accounting) that DiLoCo's `_apply_pseudogradient`
	may consult depending on version.

	Fix direction:
	- Add a single integration test that constructs
	`make_diloco_outer_loop(manager=MockManager(store), ...)` against a
	tiny `nn.Linear` and runs one outer round — even single-process.
	- Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and
	add stubs for any methods DiLoCo actually consults beyond `allreduce`.

	---

	## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated

	Severity: SUGGESTION (NIT bordering)
	Evidence:
	- README.md and V1_V8_COVERAGE both state: *"9 multi-process tests
	pinning the allreduce barrier."*
	- Actual breakdown:
	- 4 single-process unit tests + `test_mock_manager_shape_compat` (5)
	- 4 multi-process tests spawning subprocesses (parametrized [2,3] of
	`_runs_allreduce_across_replicas`, `_handles_multiple_rounds`,
	`_reports_failed_replicas`)
	- Of the 4 multi-process tests, only **3 actually exercise the
	allreduce barrier**; `_reports_failed_replicas` deliberately raises
	before any allreduce call.

	Wave 13 clearly does NOT fake-pass via world_size=1 — the multi-
	process barrier is real. But the count is rounded up.

	Fix direction: Replace "9 multi-process tests" with "9 tests
	covering the serverless DiLoCo layer, of which 4 spawn real
	subprocesses and 3 exercise the allreduce barrier across replicas."

	---

	## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`

	Severity: SUGGESTION
	Evidence: `composer_replication/recipes/prime_rl/composer_loss.py:62-68`
	computes:
	```python
	grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
	```

	This is plain REINFORCE with advantage. PRIME-RL's `LossInputs`
	exposes `inference_logprobs` precisely because GRPO-with-replay-buffer
	requires the importance-sampling ratio
	`exp(trainer_lp - inference_lp)` (PPO-style clipped objective).

	The file says "SKELETON" so this isn't a hidden bug per se, but the
	loss is labeled GRPO and is not GRPO.

	Fix direction: Either implement the ratio + clipping (~20 LOC) or
	rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.

	---

	## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__`

	Severity: NIT (this is documented, but README phrasing is slightly soft)
	Evidence: Honestly documented as skeletons in the code, ADR-005,
	and README. NIT: a user trying `ModalExecutor()` gets a runtime error
	rather than an import-time clue.

	Fix direction: Low priority. Update README phrase to "skeleton-only
	— raises NotImplementedError until v0.x." Or use a `__getattr__` on
	the package that raises a clearer message.

	---

	## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)

	Severity: NIT
	Evidence: `test_distillation_losses.py:27-46` calls `simpo_loss`
	with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded
	above by 0; positive values aren't possible from any softmax. The tests
	still verify the formula correctly, but the test inputs aren't legal.

	Fix direction: Use negative values — purely cosmetic.

	---

	## Cross-cutting risk check

	73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still
	pass. The new `composer_replication.diloco.serverless` package is
	purely additive; the existing `make_diloco_outer_loop` is untouched.
	No cross-wave regressions detected on CPU. GPU tests + slow CPU
	e2e tests not re-run; regression risk low since Wave 13 doesn't touch
	their dependencies.

	---

	## Summary scorecard

	\| Item \| Verdict \|
	\|---\|---\|
	\| Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone \| ✅ Real, well-tested, paper-faithful \|
	\| Distillation integrated into `compose_loss` \| ❌ Not implemented despite ADR-007 (Finding 2) \|
	\| ObjectStoreAllReduce + LocalProcessExecutor \| ✅ Real multi-process barrier validated \|
	\| MockManager → DiLoCo drop-in \| 🟡 Shape-checked only; integration unverified (Finding 4) \|
	\| Modal/HFJobs adapters \| 🟡 Honestly documented as skeletons (Finding 7) \|
	\| Replaysim DJNormalizer passthrough \| ✅ Works \|
	\| Replaysim default.yaml against real data-juicer \| ❌ Recipe field types don't match record shape (Finding 3) \|
	\| PRIME-RL composer_loss.loss_fn \| ❌ SDPO term silently 0 (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) \|
	\| Monarch actors \| ✅ Honest skeleton; raises NotImplementedError \|
	\| Altered-minds tie-in doc \| ✅ Design-only, scoped honestly \|
	\| 35 new tests \| All pass; 3 of 4 multi-process tests are genuine (Finding 5) \|

	Recommendation: Address Findings 1 and 2 before publishing the
	Wave 13 expansion as "closed." Findings 3 and 4 should be addressed
	before any user attempts the real data-juicer or real torchft DiLoCo
	path. Findings 5–8 are cleanup.