composer-replication-framework / docs /adrs /ADR-005-serverless-diloco.md

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 12 days ago

preview code

raw

history blame contribute delete

6.24 kB

ADR-005 — Decoupled DiLoCo over serverless training systems

Status: Accepted Date: 2026-05-26 Wave: 13

Context

The brief's V2 clause says:

take that and combine it with diloco (decoupled, open, any variant of diloco)

The user expanded 2026-05-26: "Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). we need this both on the dataset generation and the RL orchestration side of things."

Spike 008 wrote composer_replication.diloco.make_diloco_outer_loop (wraps torchft.local_sgd.DiLoCo) but that's a single-process API. To realize "Decoupled DiLoCo across serverless executors" we need:

An abstraction layer that lets the framework launch N replicas on different serverless backends (Modal, HF Jobs, SageMaker, etc.) without per-backend code in the trainer.
A communication primitive that doesn't require inter-job NCCL/RDMA (most serverless executors don't expose that, and DiLoCo doesn't need it — sync happens once per ~500-1000 inner steps).

Options considered

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md audited 6 executors:

Executor	Inter-job network	Cold start	$/A100·hr	$/H100·hr
Modal	yes (cluster mode)	~30s	$1.95	$5.50
HuggingFace Jobs	no	~60s	$4.18	$9.50
AWS SageMaker training	yes (warm pools)	~3-5min	~$3.06	~$8.50
GCP Vertex AI	yes (cluster)	~5-10min	~$3.67	~$10
Azure ML	yes (cluster)	~5-10min	~$3.67	~$10
k8s + Volcano/KubeRay	yes (cluster IP)	~30-90s	(BYO)	(BYO)

Most expose a "spin up a job, run a script" interface. Few expose inter-job networking; the ones that do require explicit cluster mode (extra cost + config).

Decision

Adopt object-store rendezvous as the default DiLoCo communication primitive across all serverless executors. Specifically:

composer_replication.diloco.serverless package
class ServerlessExecutor(Protocol) — uniform interface with launch_replicas / poll / stream_logs / cancel / collect / backend_name / supports_inter_replica_network
class ObjectStoreAllReduce — fsspec-backed pseudo-gradient exchange using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable bucket
v0 concrete adapters: ModalExecutor and HFJobsExecutor
v0.1+ adapters: RunPodExecutor, SageMakerExecutor, K8sExecutor

Why object-store rendezvous (not NCCL across jobs)

DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is once per H = 500-1000 inner steps, equivalent to ~10-30 minutes of wall-clock at typical post-training step rates. For a 1B-param model in bf16:

Pseudo-gradient size: ~2 GB per replica per outer round
Sync frequency: ~once per 30 minutes
Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object storage with a single PutObject per replica + GetObject per other replica

Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read spread over 30 minutes = ~70 MB/s aggregate. S3 free-tier handles this without breaking a sweat, and S3 cross-job reads cost ~$0.0001 per GET. Total inter-replica communication cost: ~$0.05 per outer round. Negligible compared to GPU spend.

By contrast, cross-job NCCL would require:

Inter-job networking (mostly unavailable on serverless)
Sustained low-latency connections (vs. burst-IO once per 30min)
Backend-specific cluster mode (Modal-only on some platforms)

Object-store rendezvous decouples the algorithm from the executor and matches DiLoCo's actual communication profile.

Why Modal + HF Jobs as the v0 executors

Modal: best dev velocity, sub-minute cold start, mature Python SDK, user already has CLI configured. Gives us a fast iteration loop for the serverless layer.
HuggingFace Jobs: zero acquisition cost (HF token already wired up), brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr. Not the cheapest, but the right "default executor for HF users."

These two cover the spectrum of "fast for development" + "natural HF integration." Other executors are documented and stubbed but not implemented in v0.

Consequences

Accepted

New package composer_replication.diloco.serverless:
- executor.py — ServerlessExecutor Protocol + base class
- allreduce.py — ObjectStoreAllReduce mockManager that drops into make_diloco_outer_loop with no changes to the existing wrapper
- modal.py — ModalExecutor (~150 LOC)
- hf_jobs.py — HFJobsExecutor (~150 LOC)
- replica_entrypoint.py — the script each replica runs (loaded from HF Datasets / object store)
New optional dependency [serverless] extra: pip install -e .[serverless] pulls fsspec, s3fs, huggingface_hub (already a transitive dep), and modal-client (only if user opts in to Modal).
Smoke test in spikes/009-decoupled-diloco/ (new, deferred — not part of this wave's commit) — local-only file:// rendezvous between two Python processes in tests/test_serverless_local.py. Multi-cloud test is post-replication.

Open / deferred

Real serverless smoke: spinning up 2 Modal containers + S3 rendezvous
- verifying both converge. Deferred to a small-budget post-Wave-13 spike ($2-5 estimated). Not blocking for the v0 packaging.
HF Jobs API stability: HF Jobs is a relatively new product. The recon flagged "API may evolve through 2026"; we pin to a specific huggingface_hub minor and bump deliberately.

Trade-offs explicitly accepted

We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second cross-job NCCL but costs more and is Modal-only. Object-store rendezvous is the right default; users on Modal who want faster sync can override.
We do NOT support job-internal multi-GPU training in this layer. The serverless layer is for inter-replica sync; intra-replica training uses the existing make_diloco_outer_loop (which itself can wrap multi-GPU FSDP via torchft).

Source

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md (2026-05-26 subagent recon, primary-sourced from each provider's official docs + pricing pages).