composer-replication-framework / docs /adrs /ADR-005-serverless-diloco.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31

ADR-005 — Decoupled DiLoCo over serverless training systems

Status: Accepted Date: 2026-05-26 Wave: 13

Context

The brief's V2 clause says:

take that and combine it with diloco (decoupled, open, any variant of diloco)

The user expanded 2026-05-26: "Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). we need this both on the dataset generation and the RL orchestration side of things."

Spike 008 wrote composer_replication.diloco.make_diloco_outer_loop (wraps torchft.local_sgd.DiLoCo) but that's a single-process API. To realize "Decoupled DiLoCo across serverless executors" we need:

  1. An abstraction layer that lets the framework launch N replicas on different serverless backends (Modal, HF Jobs, SageMaker, etc.) without per-backend code in the trainer.
  2. A communication primitive that doesn't require inter-job NCCL/RDMA (most serverless executors don't expose that, and DiLoCo doesn't need it — sync happens once per ~500-1000 inner steps).

Options considered

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md audited 6 executors:

Executor Inter-job network Cold start $/A100·hr $/H100·hr
Modal yes (cluster mode) ~30s $1.95 $5.50
HuggingFace Jobs no ~60s $4.18 $9.50
AWS SageMaker training yes (warm pools) ~3-5min ~$3.06 ~$8.50
GCP Vertex AI yes (cluster) ~5-10min ~$3.67 ~$10
Azure ML yes (cluster) ~5-10min ~$3.67 ~$10
k8s + Volcano/KubeRay yes (cluster IP) ~30-90s (BYO) (BYO)

Most expose a "spin up a job, run a script" interface. Few expose inter-job networking; the ones that do require explicit cluster mode (extra cost + config).

Decision

Adopt object-store rendezvous as the default DiLoCo communication primitive across all serverless executors. Specifically:

  • composer_replication.diloco.serverless package
  • class ServerlessExecutor(Protocol) — uniform interface with launch_replicas / poll / stream_logs / cancel / collect / backend_name / supports_inter_replica_network
  • class ObjectStoreAllReduce — fsspec-backed pseudo-gradient exchange using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable bucket
  • v0 concrete adapters: ModalExecutor and HFJobsExecutor
  • v0.1+ adapters: RunPodExecutor, SageMakerExecutor, K8sExecutor

Why object-store rendezvous (not NCCL across jobs)

DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is once per H = 500-1000 inner steps, equivalent to ~10-30 minutes of wall-clock at typical post-training step rates. For a 1B-param model in bf16:

  • Pseudo-gradient size: ~2 GB per replica per outer round
  • Sync frequency: ~once per 30 minutes
  • Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object storage with a single PutObject per replica + GetObject per other replica

Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read spread over 30 minutes = ~70 MB/s aggregate. S3 free-tier handles this without breaking a sweat, and S3 cross-job reads cost ~$0.0001 per GET. Total inter-replica communication cost: ~$0.05 per outer round. Negligible compared to GPU spend.

By contrast, cross-job NCCL would require:

  • Inter-job networking (mostly unavailable on serverless)
  • Sustained low-latency connections (vs. burst-IO once per 30min)
  • Backend-specific cluster mode (Modal-only on some platforms)

Object-store rendezvous decouples the algorithm from the executor and matches DiLoCo's actual communication profile.

Why Modal + HF Jobs as the v0 executors

  • Modal: best dev velocity, sub-minute cold start, mature Python SDK, user already has CLI configured. Gives us a fast iteration loop for the serverless layer.
  • HuggingFace Jobs: zero acquisition cost (HF token already wired up), brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr. Not the cheapest, but the right "default executor for HF users."

These two cover the spectrum of "fast for development" + "natural HF integration." Other executors are documented and stubbed but not implemented in v0.

Consequences

Accepted

  • New package composer_replication.diloco.serverless:
    • executor.pyServerlessExecutor Protocol + base class
    • allreduce.pyObjectStoreAllReduce mockManager that drops into make_diloco_outer_loop with no changes to the existing wrapper
    • modal.pyModalExecutor (~150 LOC)
    • hf_jobs.pyHFJobsExecutor (~150 LOC)
    • replica_entrypoint.py — the script each replica runs (loaded from HF Datasets / object store)
  • New optional dependency [serverless] extra: pip install -e .[serverless] pulls fsspec, s3fs, huggingface_hub (already a transitive dep), and modal-client (only if user opts in to Modal).
  • Smoke test in spikes/009-decoupled-diloco/ (new, deferred — not part of this wave's commit) — local-only file:// rendezvous between two Python processes in tests/test_serverless_local.py. Multi-cloud test is post-replication.

Open / deferred

  • Real serverless smoke: spinning up 2 Modal containers + S3 rendezvous
    • verifying both converge. Deferred to a small-budget post-Wave-13 spike ($2-5 estimated). Not blocking for the v0 packaging.
  • HF Jobs API stability: HF Jobs is a relatively new product. The recon flagged "API may evolve through 2026"; we pin to a specific huggingface_hub minor and bump deliberately.

Trade-offs explicitly accepted

  • We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second cross-job NCCL but costs more and is Modal-only. Object-store rendezvous is the right default; users on Modal who want faster sync can override.
  • We do NOT support job-internal multi-GPU training in this layer. The serverless layer is for inter-replica sync; intra-replica training uses the existing make_diloco_outer_loop (which itself can wrap multi-GPU FSDP via torchft).

Source

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md (2026-05-26 subagent recon, primary-sourced from each provider's official docs + pricing pages).