File size: 5,046 Bytes
ac4bfb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# ADR-003 — DiLoCo implementation choice for Spike 008

**Status**: Accepted
**Date**: 2026-05-26
**Wave**: Phase 4 (deep work loop)

## Context

Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
real working integration, not a hand-rolled toy.

The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
inner steps, apply Nesterov-momentum outer step across replicas. We need a
PyTorch-compatible reference implementation that runs in single-process for
unit tests AND scales out on torch.distributed when we eventually run real
multi-replica training.

## Options considered

| Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
|---|---|---|---|---|---|
| `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) |
| `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing |
| `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly |
| `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
| DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |

Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).

## Decision

**`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active,
single-process testable).

Rationale:

1. **Library, not research code.** Proper packaging on PyPI with prebuilt
   wheels (`pip install torchft-nightly`), real test suite, version history,
   maintained by Meta. The other live candidates are research codebases that
   break on torch version bumps.

2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts
   `model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
   `model_fragments=[model]` (single fragment, full-model sync) for vanilla
   DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
   to choose at the API level — both modes are one parameter apart.

3. **Single-process unit-testable.** torchft's own tests use
   `MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
   shared-buffer mock allreduce that does real averaging across two
   in-process replicas. Verified working pattern in the recon doc.

4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
   `perform_sync` (line 423).** Direct extension point — we can subclass or
   monkey-patch these to compose with our Composer trainer.

### Risks accepted (with mitigations)

| Risk | Mitigation |
|---|---|
| **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. |
| **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
| **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). |
| **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |

## Consequences

### Accepted

- Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
  ready-to-paste pytest pattern as the smoke:
  - 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
  - shared-buffer mock allreduce (no NCCL)
  - assertions: replica equality after sync, params actually moved, Nesterov
    state populated, sync count matches expected

- `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
  with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
  versioned wheel.

- Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
  v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.

- The sign-convention mismatch is **made explicit** in our wrapper code with
  a unit test that catches a sign flip if torchft ever inverts it.

### Rejected paths

- **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test
  surface for distributed-correctness is large; reusing a Meta-maintained
  library cuts the audit burden.
- **`diloco_simple`.** Disqualified by the license absence alone.
- **`prime` / INTELLECT-1.** Right tool for production multi-node runs,
  wrong tool for a single-process unit test.

## Source

`docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
from torchft repo cloned + read locally, 2026-05-26).