Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # ADR-003 — DiLoCo implementation choice for Spike 008 | |
| **Status**: Accepted | |
| **Date**: 2026-05-26 | |
| **Wave**: Phase 4 (deep work loop) | |
| ## Context | |
| Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to | |
| v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a | |
| real working integration, not a hand-rolled toy. | |
| The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N | |
| inner steps, apply Nesterov-momentum outer step across replicas. We need a | |
| PyTorch-compatible reference implementation that runs in single-process for | |
| unit tests AND scales out on torch.distributed when we eventually run real | |
| multi-replica training. | |
| ## Options considered | |
| | Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? | | |
| |---|---|---|---|---|---| | |
| | `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) | | |
| | `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing | | |
| | `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly | | |
| | `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked | | |
| | DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — | | |
| Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26). | |
| ## Decision | |
| **`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active, | |
| single-process testable). | |
| Rationale: | |
| 1. **Library, not research code.** Proper packaging on PyPI with prebuilt | |
| wheels (`pip install torchft-nightly`), real test suite, version history, | |
| maintained by Meta. The other live candidates are research codebases that | |
| break on torch version bumps. | |
| 2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts | |
| `model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set | |
| `model_fragments=[model]` (single fragment, full-model sync) for vanilla | |
| DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have | |
| to choose at the API level — both modes are one parameter apart. | |
| 3. **Single-process unit-testable.** torchft's own tests use | |
| `MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same: | |
| shared-buffer mock allreduce that does real averaging across two | |
| in-process replicas. Verified working pattern in the recon doc. | |
| 4. **Pseudo-gradient computation is in `_save_grads` (line 324) and | |
| `perform_sync` (line 423).** Direct extension point — we can subclass or | |
| monkey-patch these to compose with our Composer trainer. | |
| ### Risks accepted (with mitigations) | |
| | Risk | Mitigation | | |
| |---|---| | |
| | **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. | | |
| | **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. | | |
| | **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). | | |
| | **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). | | |
| ## Consequences | |
| ### Accepted | |
| - Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's | |
| ready-to-paste pytest pattern as the smoke: | |
| - 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP | |
| - shared-buffer mock allreduce (no NCCL) | |
| - assertions: replica equality after sync, params actually moved, Nesterov | |
| state populated, sync count matches expected | |
| - `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo` | |
| with our trainer's hooks. We DO NOT fork torchft — we depend on it as a | |
| versioned wheel. | |
| - Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for | |
| v0.1. Streaming DiLoCo is a configuration-flag away in v0.2. | |
| - The sign-convention mismatch is **made explicit** in our wrapper code with | |
| a unit test that catches a sign flip if torchft ever inverts it. | |
| ### Rejected paths | |
| - **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test | |
| surface for distributed-correctness is large; reusing a Meta-maintained | |
| library cuts the audit burden. | |
| - **`diloco_simple`.** Disqualified by the license absence alone. | |
| - **`prime` / INTELLECT-1.** Right tool for production multi-node runs, | |
| wrong tool for a single-process unit test. | |
| ## Source | |
| `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced | |
| from torchft repo cloned + read locally, 2026-05-26). | |