# ADR-005 — Decoupled DiLoCo over serverless training systems **Status**: Accepted **Date**: 2026-05-26 **Wave**: 13 ## Context The brief's V2 clause says: > take that and combine it with diloco (decoupled, open, any variant of diloco) The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). we need this both on the dataset generation and the RL orchestration side of things."* Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop` (wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To realize "Decoupled DiLoCo across serverless executors" we need: 1. An abstraction layer that lets the framework launch N replicas on different serverless backends (Modal, HF Jobs, SageMaker, etc.) without per-backend code in the trainer. 2. A communication primitive that doesn't require inter-job NCCL/RDMA (most serverless executors don't expose that, and DiLoCo doesn't need it — sync happens once per ~500-1000 inner steps). ## Options considered `docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors: | Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr | |---|---|---|---|---| | Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 | | HuggingFace Jobs | no | ~60s | $4.18 | $9.50 | | AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 | | GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 | | Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 | | k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) | Most expose a "spin up a job, run a script" interface. Few expose inter-job networking; the ones that do require explicit cluster mode (extra cost + config). ## Decision **Adopt object-store rendezvous as the default DiLoCo communication primitive across all serverless executors.** Specifically: - `composer_replication.diloco.serverless` package - `class ServerlessExecutor(Protocol)` — uniform interface with `launch_replicas / poll / stream_logs / cancel / collect / backend_name / supports_inter_replica_network` - `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable bucket - v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor` - v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor` ### Why object-store rendezvous (not NCCL across jobs) DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at typical post-training step rates. For a 1B-param model in bf16: - Pseudo-gradient size: ~2 GB per replica per outer round - Sync frequency: ~once per 30 minutes - Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object storage with a single `PutObject` per replica + `GetObject` per other replica Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per GET. Total inter-replica communication cost: ~$0.05 per outer round. **Negligible compared to GPU spend.** By contrast, cross-job NCCL would require: - Inter-job networking (mostly unavailable on serverless) - Sustained low-latency connections (vs. burst-IO once per 30min) - Backend-specific cluster mode (Modal-only on some platforms) Object-store rendezvous decouples the algorithm from the executor and matches DiLoCo's actual communication profile. ### Why Modal + HF Jobs as the v0 executors - **Modal**: best dev velocity, sub-minute cold start, mature Python SDK, user already has CLI configured. Gives us a fast iteration loop for the serverless layer. - **HuggingFace Jobs**: zero acquisition cost (HF token already wired up), brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr. Not the cheapest, but the right "default executor for HF users." These two cover the spectrum of "fast for development" + "natural HF integration." Other executors are documented and stubbed but not implemented in v0. ## Consequences ### Accepted - New package `composer_replication.diloco.serverless`: - `executor.py` — `ServerlessExecutor` Protocol + base class - `allreduce.py` — `ObjectStoreAllReduce` mockManager that drops into `make_diloco_outer_loop` with no changes to the existing wrapper - `modal.py` — `ModalExecutor` (~150 LOC) - `hf_jobs.py` — `HFJobsExecutor` (~150 LOC) - `replica_entrypoint.py` — the script each replica runs (loaded from HF Datasets / object store) - New optional dependency `[serverless]` extra: `pip install -e .[serverless]` pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and `modal-client` (only if user opts in to Modal). - Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part of this wave's commit) — local-only `file://` rendezvous between two Python processes in `tests/test_serverless_local.py`. Multi-cloud test is post-replication. ### Open / deferred - **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous + verifying both converge. Deferred to a small-budget post-Wave-13 spike ($2-5 estimated). Not blocking for the v0 packaging. - **HF Jobs API stability**: HF Jobs is a relatively new product. The recon flagged "API may evolve through 2026"; we pin to a specific `huggingface_hub` minor and bump deliberately. ### Trade-offs explicitly accepted - We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second cross-job NCCL but costs more and is Modal-only. Object-store rendezvous is the right default; users on Modal who want faster sync can override. - We do NOT support job-internal multi-GPU training in this layer. The serverless layer is for **inter-replica** sync; intra-replica training uses the existing `make_diloco_outer_loop` (which itself can wrap multi-GPU FSDP via torchft). ## Source `docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent recon, primary-sourced from each provider's official docs + pricing pages).