# ADR-003 — DiLoCo implementation choice for Spike 008 **Status**: Accepted **Date**: 2026-05-26 **Wave**: Phase 4 (deep work loop) ## Context Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a real working integration, not a hand-rolled toy. The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N inner steps, apply Nesterov-momentum outer step across replicas. We need a PyTorch-compatible reference implementation that runs in single-process for unit tests AND scales out on torch.distributed when we eventually run real multi-replica training. ## Options considered | Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? | |---|---|---|---|---|---| | `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) | | `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing | | `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly | | `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked | | DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — | Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26). ## Decision **`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active, single-process testable). Rationale: 1. **Library, not research code.** Proper packaging on PyPI with prebuilt wheels (`pip install torchft-nightly`), real test suite, version history, maintained by Meta. The other live candidates are research codebases that break on torch version bumps. 2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts `model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set `model_fragments=[model]` (single fragment, full-model sync) for vanilla DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have to choose at the API level — both modes are one parameter apart. 3. **Single-process unit-testable.** torchft's own tests use `MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same: shared-buffer mock allreduce that does real averaging across two in-process replicas. Verified working pattern in the recon doc. 4. **Pseudo-gradient computation is in `_save_grads` (line 324) and `perform_sync` (line 423).** Direct extension point — we can subclass or monkey-patch these to compose with our Composer trainer. ### Risks accepted (with mitigations) | Risk | Mitigation | |---|---| | **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. | | **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. | | **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). | | **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). | ## Consequences ### Accepted - Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's ready-to-paste pytest pattern as the smoke: - 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP - shared-buffer mock allreduce (no NCCL) - assertions: replica equality after sync, params actually moved, Nesterov state populated, sync count matches expected - `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo` with our trainer's hooks. We DO NOT fork torchft — we depend on it as a versioned wheel. - Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for v0.1. Streaming DiLoCo is a configuration-flag away in v0.2. - The sign-convention mismatch is **made explicit** in our wrapper code with a unit test that catches a sign flip if torchft ever inverts it. ### Rejected paths - **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test surface for distributed-correctness is large; reusing a Meta-maintained library cuts the audit burden. - **`diloco_simple`.** Disqualified by the license absence alone. - **`prime` / INTELLECT-1.** Right tool for production multi-node runs, wrong tool for a single-process unit test. ## Source `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced from torchft repo cloned + read locally, 2026-05-26).