Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
ADR-003 — DiLoCo implementation choice for Spike 008
Status: Accepted Date: 2026-05-26 Wave: Phase 4 (deep work loop)
Context
Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a real working integration, not a hand-rolled toy.
The integration target: take pseudo-gradient δ = θ_local − θ_initial after N
inner steps, apply Nesterov-momentum outer step across replicas. We need a
PyTorch-compatible reference implementation that runs in single-process for
unit tests AND scales out on torch.distributed when we eventually run real
multi-replica training.
Options considered
| Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
|---|---|---|---|---|---|
meta-pytorch/torchft |
BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via MagicMock(Manager) + _DummyWork pattern in their own tests) |
OpenDiLoCo (PrimeIntellect) |
Apache 2.0 | 2024 | README says "no longer maintained"; replaced by prime |
Partial | Hivemind dependency complicates testing |
prime / INTELLECT-1 (PrimeIntellect) |
Apache 2.0 | 2025 | Production framework (ElasticDeviceMesh etc.) |
Yes | Heavy harness; not single-process friendly |
diloco_simple |
No LICENSE file | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
| DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |
Source: docs/research/DILOCO_RECONNAISSANCE.md (subagent recon, 2026-05-26).
Decision
meta-pytorch/torchft — torchft.local_sgd.DiLoCo (BSD-3, active,
single-process testable).
Rationale:
Library, not research code. Proper packaging on PyPI with prebuilt wheels (
pip install torchft-nightly), real test suite, version history, maintained by Meta. The other live candidates are research codebases that break on torch version bumps.Streaming DiLoCo is the generalization. The
DiLoCoclass acceptsmodel_fragments+fragment_sync_delay+fragment_update_alpha. Setmodel_fragments=[model](single fragment, full-model sync) for vanilla DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have to choose at the API level — both modes are one parameter apart.Single-process unit-testable. torchft's own tests use
MagicMock(Manager)+_DummyWorkto bypass NCCL. We can do the same: shared-buffer mock allreduce that does real averaging across two in-process replicas. Verified working pattern in the recon doc.Pseudo-gradient computation is in
_save_grads(line 324) andperform_sync(line 423). Direct extension point — we can subclass or monkey-patch these to compose with our Composer trainer.
Risks accepted (with mitigations)
| Risk | Mitigation |
|---|---|
Sign convention mismatch — torchft computes θ_initial − θ_local (negation of our spec) |
Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our outer_optimizer.py. |
| Wheel brittleness for nightly | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
torch>=2.7 requirement |
Confirm our existing eidolon venv has it. Already does (verified). |
fragment_sync_delay > 0 requires CUDA streams |
Spike 008 uses fragment_sync_delay=0 (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |
Consequences
Accepted
Spike 008 imports
torchft.local_sgd.DiLoCoand runs the recon doc's ready-to-paste pytest pattern as the smoke:- 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
- shared-buffer mock allreduce (no NCCL)
- assertions: replica equality after sync, params actually moved, Nesterov state populated, sync count matches expected
composer_replication.dilocopackage wrapstorchft.local_sgd.DiLoCowith our trainer's hooks. We DO NOT fork torchft — we depend on it as a versioned wheel.Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
The sign-convention mismatch is made explicit in our wrapper code with a unit test that catches a sign flip if torchft ever inverts it.
Rejected paths
- Roll our own DiLoCo. Tempting (the algorithm is short) but the test surface for distributed-correctness is large; reusing a Meta-maintained library cuts the audit burden.
diloco_simple. Disqualified by the license absence alone.prime/ INTELLECT-1. Right tool for production multi-node runs, wrong tool for a single-process unit test.
Source
docs/research/DILOCO_RECONNAISSANCE.md (subagent recon, primary-sourced
from torchft repo cloned + read locally, 2026-05-26).