composer-replication-framework / docs /adrs /ADR-003-diloco-impl.md
Codeseys's picture
Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
ac4bfb4

ADR-003 — DiLoCo implementation choice for Spike 008

Status: Accepted Date: 2026-05-26 Wave: Phase 4 (deep work loop)

Context

Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a real working integration, not a hand-rolled toy.

The integration target: take pseudo-gradient δ = θ_local − θ_initial after N inner steps, apply Nesterov-momentum outer step across replicas. We need a PyTorch-compatible reference implementation that runs in single-process for unit tests AND scales out on torch.distributed when we eventually run real multi-replica training.

Options considered

Repo License Last commit Maturity Streaming variant? Single-process testable?
meta-pytorch/torchft BSD-3 2026-04-03 Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) Yes (verified via MagicMock(Manager) + _DummyWork pattern in their own tests)
OpenDiLoCo (PrimeIntellect) Apache 2.0 2024 README says "no longer maintained"; replaced by prime Partial Hivemind dependency complicates testing
prime / INTELLECT-1 (PrimeIntellect) Apache 2.0 2025 Production framework (ElasticDeviceMesh etc.) Yes Heavy harness; not single-process friendly
diloco_simple No LICENSE file 2024-05-31 8 commits ever; pedagogical No NCCL-locked
DeepMind original (Douillard et al. arXiv:2311.08105) No public reference impl

Source: docs/research/DILOCO_RECONNAISSANCE.md (subagent recon, 2026-05-26).

Decision

meta-pytorch/torchfttorchft.local_sgd.DiLoCo (BSD-3, active, single-process testable).

Rationale:

  1. Library, not research code. Proper packaging on PyPI with prebuilt wheels (pip install torchft-nightly), real test suite, version history, maintained by Meta. The other live candidates are research codebases that break on torch version bumps.

  2. Streaming DiLoCo is the generalization. The DiLoCo class accepts model_fragments + fragment_sync_delay + fragment_update_alpha. Set model_fragments=[model] (single fragment, full-model sync) for vanilla DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have to choose at the API level — both modes are one parameter apart.

  3. Single-process unit-testable. torchft's own tests use MagicMock(Manager) + _DummyWork to bypass NCCL. We can do the same: shared-buffer mock allreduce that does real averaging across two in-process replicas. Verified working pattern in the recon doc.

  4. Pseudo-gradient computation is in _save_grads (line 324) and perform_sync (line 423). Direct extension point — we can subclass or monkey-patch these to compose with our Composer trainer.

Risks accepted (with mitigations)

Risk Mitigation
Sign convention mismatch — torchft computes θ_initial − θ_local (negation of our spec) Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our outer_optimizer.py.
Wheel brittleness for nightly Pin a specific dated nightly slug in our pyproject.toml; bump deliberately.
torch>=2.7 requirement Confirm our existing eidolon venv has it. Already does (verified).
fragment_sync_delay > 0 requires CUDA streams Spike 008 uses fragment_sync_delay=0 (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication).

Consequences

Accepted

  • Spike 008 imports torchft.local_sgd.DiLoCo and runs the recon doc's ready-to-paste pytest pattern as the smoke:

    • 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
    • shared-buffer mock allreduce (no NCCL)
    • assertions: replica equality after sync, params actually moved, Nesterov state populated, sync count matches expected
  • composer_replication.diloco package wraps torchft.local_sgd.DiLoCo with our trainer's hooks. We DO NOT fork torchft — we depend on it as a versioned wheel.

  • Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.

  • The sign-convention mismatch is made explicit in our wrapper code with a unit test that catches a sign flip if torchft ever inverts it.

Rejected paths

  • Roll our own DiLoCo. Tempting (the algorithm is short) but the test surface for distributed-correctness is large; reusing a Meta-maintained library cuts the audit burden.
  • diloco_simple. Disqualified by the license absence alone.
  • prime / INTELLECT-1. Right tool for production multi-node runs, wrong tool for a single-process unit test.

Source

docs/research/DILOCO_RECONNAISSANCE.md (subagent recon, primary-sourced from torchft repo cloned + read locally, 2026-05-26).