ADR-003 — DiLoCo implementation choice for Spike 008

Status: Accepted Date: 2026-05-26 Wave: Phase 4 (deep work loop)

Context

Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a real working integration, not a hand-rolled toy.

The integration target: take pseudo-gradient δ = θ_local − θ_initial after N inner steps, apply Nesterov-momentum outer step across replicas. We need a PyTorch-compatible reference implementation that runs in single-process for unit tests AND scales out on torch.distributed when we eventually run real multi-replica training.

Options considered

Repo	License	Last commit	Maturity	Streaming variant?	Single-process testable?
`meta-pytorch/torchft`	BSD-3	2026-04-03	Library (PyPI, prebuilt wheels, real test suite, Meta-maintained)	Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment)	Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests)
`OpenDiLoCo` (PrimeIntellect)	Apache 2.0	2024	README says "no longer maintained"; replaced by `prime`	Partial	Hivemind dependency complicates testing
`prime` / INTELLECT-1 (PrimeIntellect)	Apache 2.0	2025	Production framework (`ElasticDeviceMesh` etc.)	Yes	Heavy harness; not single-process friendly
`diloco_simple`	No LICENSE file	2024-05-31	8 commits ever; pedagogical	No	NCCL-locked
DeepMind original (Douillard et al. arXiv:2311.08105)	—	—	No public reference impl	—	—

Source: docs/research/DILOCO_RECONNAISSANCE.md (subagent recon, 2026-05-26).

Decision

meta-pytorch/torchft — torchft.local_sgd.DiLoCo (BSD-3, active, single-process testable).

Rationale:

Library, not research code. Proper packaging on PyPI with prebuilt wheels (pip install torchft-nightly), real test suite, version history, maintained by Meta. The other live candidates are research codebases that break on torch version bumps.
Streaming DiLoCo is the generalization. The DiLoCo class accepts model_fragments + fragment_sync_delay + fragment_update_alpha. Set model_fragments=[model] (single fragment, full-model sync) for vanilla DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have to choose at the API level — both modes are one parameter apart.
Single-process unit-testable. torchft's own tests use MagicMock(Manager) + _DummyWork to bypass NCCL. We can do the same: shared-buffer mock allreduce that does real averaging across two in-process replicas. Verified working pattern in the recon doc.
Pseudo-gradient computation is in _save_grads (line 324) and perform_sync (line 423). Direct extension point — we can subclass or monkey-patch these to compose with our Composer trainer.

Risks accepted (with mitigations)

Risk	Mitigation
Sign convention mismatch — torchft computes `θ_initial − θ_local` (negation of our spec)	Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`.
Wheel brittleness for nightly	Pin a specific dated nightly slug in our pyproject.toml; bump deliberately.
`torch>=2.7` requirement	Confirm our existing eidolon venv has it. Already does (verified).
`fragment_sync_delay > 0` requires CUDA streams	Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication).

Consequences

Accepted

Spike 008 imports torchft.local_sgd.DiLoCo and runs the recon doc's ready-to-paste pytest pattern as the smoke:
- 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
- shared-buffer mock allreduce (no NCCL)
- assertions: replica equality after sync, params actually moved, Nesterov state populated, sync count matches expected
composer_replication.diloco package wraps torchft.local_sgd.DiLoCo with our trainer's hooks. We DO NOT fork torchft — we depend on it as a versioned wheel.
Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
The sign-convention mismatch is made explicit in our wrapper code with a unit test that catches a sign flip if torchft ever inverts it.

Rejected paths

Roll our own DiLoCo. Tempting (the algorithm is short) but the test surface for distributed-correctness is large; reusing a Meta-maintained library cuts the audit burden.
diloco_simple. Disqualified by the license absence alone.
prime / INTELLECT-1. Right tool for production multi-node runs, wrong tool for a single-process unit test.

Source

docs/research/DILOCO_RECONNAISSANCE.md (subagent recon, primary-sourced from torchft repo cloned + read locally, 2026-05-26).