Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,046 Bytes
ac4bfb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | # ADR-003 — DiLoCo implementation choice for Spike 008
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: Phase 4 (deep work loop)
## Context
Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
real working integration, not a hand-rolled toy.
The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
inner steps, apply Nesterov-momentum outer step across replicas. We need a
PyTorch-compatible reference implementation that runs in single-process for
unit tests AND scales out on torch.distributed when we eventually run real
multi-replica training.
## Options considered
| Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
|---|---|---|---|---|---|
| `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) |
| `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing |
| `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly |
| `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
| DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |
Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).
## Decision
**`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active,
single-process testable).
Rationale:
1. **Library, not research code.** Proper packaging on PyPI with prebuilt
wheels (`pip install torchft-nightly`), real test suite, version history,
maintained by Meta. The other live candidates are research codebases that
break on torch version bumps.
2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts
`model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
`model_fragments=[model]` (single fragment, full-model sync) for vanilla
DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
to choose at the API level — both modes are one parameter apart.
3. **Single-process unit-testable.** torchft's own tests use
`MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
shared-buffer mock allreduce that does real averaging across two
in-process replicas. Verified working pattern in the recon doc.
4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
`perform_sync` (line 423).** Direct extension point — we can subclass or
monkey-patch these to compose with our Composer trainer.
### Risks accepted (with mitigations)
| Risk | Mitigation |
|---|---|
| **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. |
| **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
| **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). |
| **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |
## Consequences
### Accepted
- Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
ready-to-paste pytest pattern as the smoke:
- 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
- shared-buffer mock allreduce (no NCCL)
- assertions: replica equality after sync, params actually moved, Nesterov
state populated, sync count matches expected
- `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
versioned wheel.
- Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
- The sign-convention mismatch is **made explicit** in our wrapper code with
a unit test that catches a sign flip if torchft ever inverts it.
### Rejected paths
- **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test
surface for distributed-correctness is large; reusing a Meta-maintained
library cuts the audit burden.
- **`diloco_simple`.** Disqualified by the license absence alone.
- **`prime` / INTELLECT-1.** Right tool for production multi-node runs,
wrong tool for a single-process unit test.
## Source
`docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
from torchft repo cloned + read locally, 2026-05-26).
|