composer-replication-framework / docs /adrs /ADR-005-serverless-diloco.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31
# ADR-005 — Decoupled DiLoCo over serverless training systems
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13
## Context
The brief's V2 clause says:
> take that and combine it with diloco (decoupled, open, any variant of diloco)
The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage
modal or huggingface-jobs or other serverless training systems). we need
this both on the dataset generation and the RL orchestration side of
things."*
Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop`
(wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To
realize "Decoupled DiLoCo across serverless executors" we need:
1. An abstraction layer that lets the framework launch N replicas on
different serverless backends (Modal, HF Jobs, SageMaker, etc.) without
per-backend code in the trainer.
2. A communication primitive that doesn't require inter-job NCCL/RDMA
(most serverless executors don't expose that, and DiLoCo doesn't need
it — sync happens once per ~500-1000 inner steps).
## Options considered
`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors:
| Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr |
|---|---|---|---|---|
| Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 |
| HuggingFace Jobs | no | ~60s | $4.18 | $9.50 |
| AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 |
| GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) |
Most expose a "spin up a job, run a script" interface. Few expose inter-job
networking; the ones that do require explicit cluster mode (extra cost +
config).
## Decision
**Adopt object-store rendezvous as the default DiLoCo communication
primitive across all serverless executors.** Specifically:
- `composer_replication.diloco.serverless` package
- `class ServerlessExecutor(Protocol)` — uniform interface with
`launch_replicas / poll / stream_logs / cancel / collect /
backend_name / supports_inter_replica_network`
- `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable
bucket
- v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor`
- v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`
### Why object-store rendezvous (not NCCL across jobs)
DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per
H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at
typical post-training step rates. For a 1B-param model in bf16:
- Pseudo-gradient size: ~2 GB per replica per outer round
- Sync frequency: ~once per 30 minutes
- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
storage with a single `PutObject` per replica + `GetObject` per other
replica
Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read
spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this
without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per
GET. Total inter-replica communication cost: ~$0.05 per outer round.
**Negligible compared to GPU spend.**
By contrast, cross-job NCCL would require:
- Inter-job networking (mostly unavailable on serverless)
- Sustained low-latency connections (vs. burst-IO once per 30min)
- Backend-specific cluster mode (Modal-only on some platforms)
Object-store rendezvous decouples the algorithm from the executor and
matches DiLoCo's actual communication profile.
### Why Modal + HF Jobs as the v0 executors
- **Modal**: best dev velocity, sub-minute cold start, mature Python SDK,
user already has CLI configured. Gives us a fast iteration loop for the
serverless layer.
- **HuggingFace Jobs**: zero acquisition cost (HF token already wired up),
brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr.
Not the cheapest, but the right "default executor for HF users."
These two cover the spectrum of "fast for development" + "natural HF
integration." Other executors are documented and stubbed but not
implemented in v0.
## Consequences
### Accepted
- New package `composer_replication.diloco.serverless`:
- `executor.py``ServerlessExecutor` Protocol + base class
- `allreduce.py``ObjectStoreAllReduce` mockManager that drops into
`make_diloco_outer_loop` with no changes to the existing wrapper
- `modal.py` — `ModalExecutor` (~150 LOC)
- `hf_jobs.py` — `HFJobsExecutor` (~150 LOC)
- `replica_entrypoint.py` — the script each replica runs (loaded from
HF Datasets / object store)
- New optional dependency `[serverless]` extra: `pip install -e .[serverless]`
pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and
`modal-client` (only if user opts in to Modal).
- Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part
of this wave's commit) — local-only `file://` rendezvous between two
Python processes in `tests/test_serverless_local.py`. Multi-cloud test
is post-replication.
### Open / deferred
- **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous
+ verifying both converge. Deferred to a small-budget post-Wave-13 spike
($2-5 estimated). Not blocking for the v0 packaging.
- **HF Jobs API stability**: HF Jobs is a relatively new product. The
recon flagged "API may evolve through 2026"; we pin to a specific
`huggingface_hub` minor and bump deliberately.
### Trade-offs explicitly accepted
- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second
cross-job NCCL but costs more and is Modal-only. Object-store rendezvous
is the right default; users on Modal who want faster sync can override.
- We do NOT support job-internal multi-GPU training in this layer. The
serverless layer is for **inter-replica** sync; intra-replica training
uses the existing `make_diloco_outer_loop` (which itself can wrap
multi-GPU FSDP via torchft).
## Source
`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent
recon, primary-sourced from each provider's official docs + pricing pages).