Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs

ac4bfb4 13 days ago

5.05 kB

	# ADR-003 — DiLoCo implementation choice for Spike 008

	Status: Accepted
	Date: 2026-05-26
	Wave: Phase 4 (deep work loop)

	## Context

	Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
	v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
	real working integration, not a hand-rolled toy.

	The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
	inner steps, apply Nesterov-momentum outer step across replicas. We need a
	PyTorch-compatible reference implementation that runs in single-process for
	unit tests AND scales out on torch.distributed when we eventually run real
	multi-replica training.

	## Options considered

	\| Repo \| License \| Last commit \| Maturity \| Streaming variant? \| Single-process testable? \|
	\|---\|---\|---\|---\|---\|---\|
	\| `meta-pytorch/torchft` \| BSD-3 \| 2026-04-03 \| Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) \| Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) \| Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) \|
	\| `OpenDiLoCo` (PrimeIntellect) \| Apache 2.0 \| 2024 \| README says "no longer maintained"; replaced by `prime` \| Partial \| Hivemind dependency complicates testing \|
	\| `prime` / INTELLECT-1 (PrimeIntellect) \| Apache 2.0 \| 2025 \| Production framework (`ElasticDeviceMesh` etc.) \| Yes \| Heavy harness; not single-process friendly \|
	\| `diloco_simple` \| No LICENSE file \| 2024-05-31 \| 8 commits ever; pedagogical \| No \| NCCL-locked \|
	\| DeepMind original (Douillard et al. arXiv:2311.08105) \| — \| — \| No public reference impl \| — \| — \|

	Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).

	## Decision

	`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo` (BSD-3, active,
	single-process testable).

	Rationale:

	1. Library, not research code. Proper packaging on PyPI with prebuilt
	wheels (`pip install torchft-nightly`), real test suite, version history,
	maintained by Meta. The other live candidates are research codebases that
	break on torch version bumps.

	2. Streaming DiLoCo is the generalization. The `DiLoCo` class accepts
	`model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
	`model_fragments=[model]` (single fragment, full-model sync) for vanilla
	DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
	to choose at the API level — both modes are one parameter apart.

	3. Single-process unit-testable. torchft's own tests use
	`MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
	shared-buffer mock allreduce that does real averaging across two
	in-process replicas. Verified working pattern in the recon doc.

	4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
	`perform_sync` (line 423).** Direct extension point — we can subclass or
	monkey-patch these to compose with our Composer trainer.

	### Risks accepted (with mitigations)

	\| Risk \| Mitigation \|
	\|---\|---\|
	\| Sign convention mismatch — torchft computes `θ_initial − θ_local` (negation of our spec) \| Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. \|
	\| Wheel brittleness for nightly \| Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. \|
	\| `torch>=2.7` requirement \| Confirm our existing eidolon venv has it. Already does (verified). \|
	\| `fragment_sync_delay > 0` requires CUDA streams \| Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). \|

	## Consequences

	### Accepted

	- Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
	ready-to-paste pytest pattern as the smoke:
	- 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
	- shared-buffer mock allreduce (no NCCL)
	- assertions: replica equality after sync, params actually moved, Nesterov
	state populated, sync count matches expected

	- `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
	with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
	versioned wheel.

	- Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
	v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.

	- The sign-convention mismatch is made explicit in our wrapper code with
	a unit test that catches a sign flip if torchft ever inverts it.

	### Rejected paths

	- Roll our own DiLoCo. Tempting (the algorithm is short) but the test
	surface for distributed-correctness is large; reusing a Meta-maintained
	library cuts the audit burden.
	- `diloco_simple`. Disqualified by the license absence alone.
	- `prime` / INTELLECT-1. Right tool for production multi-node runs,
	wrong tool for a single-process unit test.

	## Source

	`docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
	from torchft repo cloned + read locally, 2026-05-26).