Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 6,236 Bytes
b266c31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | # ADR-005 — Decoupled DiLoCo over serverless training systems
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13
## Context
The brief's V2 clause says:
> take that and combine it with diloco (decoupled, open, any variant of diloco)
The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage
modal or huggingface-jobs or other serverless training systems). we need
this both on the dataset generation and the RL orchestration side of
things."*
Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop`
(wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To
realize "Decoupled DiLoCo across serverless executors" we need:
1. An abstraction layer that lets the framework launch N replicas on
different serverless backends (Modal, HF Jobs, SageMaker, etc.) without
per-backend code in the trainer.
2. A communication primitive that doesn't require inter-job NCCL/RDMA
(most serverless executors don't expose that, and DiLoCo doesn't need
it — sync happens once per ~500-1000 inner steps).
## Options considered
`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors:
| Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr |
|---|---|---|---|---|
| Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 |
| HuggingFace Jobs | no | ~60s | $4.18 | $9.50 |
| AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 |
| GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) |
Most expose a "spin up a job, run a script" interface. Few expose inter-job
networking; the ones that do require explicit cluster mode (extra cost +
config).
## Decision
**Adopt object-store rendezvous as the default DiLoCo communication
primitive across all serverless executors.** Specifically:
- `composer_replication.diloco.serverless` package
- `class ServerlessExecutor(Protocol)` — uniform interface with
`launch_replicas / poll / stream_logs / cancel / collect /
backend_name / supports_inter_replica_network`
- `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable
bucket
- v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor`
- v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`
### Why object-store rendezvous (not NCCL across jobs)
DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per
H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at
typical post-training step rates. For a 1B-param model in bf16:
- Pseudo-gradient size: ~2 GB per replica per outer round
- Sync frequency: ~once per 30 minutes
- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
storage with a single `PutObject` per replica + `GetObject` per other
replica
Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read
spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this
without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per
GET. Total inter-replica communication cost: ~$0.05 per outer round.
**Negligible compared to GPU spend.**
By contrast, cross-job NCCL would require:
- Inter-job networking (mostly unavailable on serverless)
- Sustained low-latency connections (vs. burst-IO once per 30min)
- Backend-specific cluster mode (Modal-only on some platforms)
Object-store rendezvous decouples the algorithm from the executor and
matches DiLoCo's actual communication profile.
### Why Modal + HF Jobs as the v0 executors
- **Modal**: best dev velocity, sub-minute cold start, mature Python SDK,
user already has CLI configured. Gives us a fast iteration loop for the
serverless layer.
- **HuggingFace Jobs**: zero acquisition cost (HF token already wired up),
brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr.
Not the cheapest, but the right "default executor for HF users."
These two cover the spectrum of "fast for development" + "natural HF
integration." Other executors are documented and stubbed but not
implemented in v0.
## Consequences
### Accepted
- New package `composer_replication.diloco.serverless`:
- `executor.py` — `ServerlessExecutor` Protocol + base class
- `allreduce.py` — `ObjectStoreAllReduce` mockManager that drops into
`make_diloco_outer_loop` with no changes to the existing wrapper
- `modal.py` — `ModalExecutor` (~150 LOC)
- `hf_jobs.py` — `HFJobsExecutor` (~150 LOC)
- `replica_entrypoint.py` — the script each replica runs (loaded from
HF Datasets / object store)
- New optional dependency `[serverless]` extra: `pip install -e .[serverless]`
pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and
`modal-client` (only if user opts in to Modal).
- Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part
of this wave's commit) — local-only `file://` rendezvous between two
Python processes in `tests/test_serverless_local.py`. Multi-cloud test
is post-replication.
### Open / deferred
- **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous
+ verifying both converge. Deferred to a small-budget post-Wave-13 spike
($2-5 estimated). Not blocking for the v0 packaging.
- **HF Jobs API stability**: HF Jobs is a relatively new product. The
recon flagged "API may evolve through 2026"; we pin to a specific
`huggingface_hub` minor and bump deliberately.
### Trade-offs explicitly accepted
- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second
cross-job NCCL but costs more and is Modal-only. Object-store rendezvous
is the right default; users on Modal who want faster sync can override.
- We do NOT support job-internal multi-GPU training in this layer. The
serverless layer is for **inter-replica** sync; intra-replica training
uses the existing `make_diloco_outer_loop` (which itself can wrap
multi-GPU FSDP via torchft).
## Source
`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent
recon, primary-sourced from each provider's official docs + pricing pages).
|