composer-replication-framework / docs /adrs /ADR-003-diloco-impl.md
Codeseys's picture
Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
ac4bfb4
# ADR-003 — DiLoCo implementation choice for Spike 008
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: Phase 4 (deep work loop)
## Context
Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
real working integration, not a hand-rolled toy.
The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
inner steps, apply Nesterov-momentum outer step across replicas. We need a
PyTorch-compatible reference implementation that runs in single-process for
unit tests AND scales out on torch.distributed when we eventually run real
multi-replica training.
## Options considered
| Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
|---|---|---|---|---|---|
| `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) |
| `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing |
| `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly |
| `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
| DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |
Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).
## Decision
**`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active,
single-process testable).
Rationale:
1. **Library, not research code.** Proper packaging on PyPI with prebuilt
wheels (`pip install torchft-nightly`), real test suite, version history,
maintained by Meta. The other live candidates are research codebases that
break on torch version bumps.
2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts
`model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
`model_fragments=[model]` (single fragment, full-model sync) for vanilla
DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
to choose at the API level — both modes are one parameter apart.
3. **Single-process unit-testable.** torchft's own tests use
`MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
shared-buffer mock allreduce that does real averaging across two
in-process replicas. Verified working pattern in the recon doc.
4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
`perform_sync` (line 423).** Direct extension point — we can subclass or
monkey-patch these to compose with our Composer trainer.
### Risks accepted (with mitigations)
| Risk | Mitigation |
|---|---|
| **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. |
| **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
| **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). |
| **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |
## Consequences
### Accepted
- Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
ready-to-paste pytest pattern as the smoke:
- 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
- shared-buffer mock allreduce (no NCCL)
- assertions: replica equality after sync, params actually moved, Nesterov
state populated, sync count matches expected
- `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
versioned wheel.
- Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
- The sign-convention mismatch is **made explicit** in our wrapper code with
a unit test that catches a sign flip if torchft ever inverts it.
### Rejected paths
- **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test
surface for distributed-correctness is large; reusing a Meta-maintained
library cuts the audit burden.
- **`diloco_simple`.** Disqualified by the license absence alone.
- **`prime` / INTELLECT-1.** Right tool for production multi-node runs,
wrong tool for a single-process unit test.
## Source
`docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
from torchft repo cloned + read locally, 2026-05-26).