# ADR-005 — Decoupled DiLoCo over serverless training systems

**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13

## Context

The brief's V2 clause says:

> take that and combine it with diloco (decoupled, open, any variant of diloco)

The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage
modal or huggingface-jobs or other serverless training systems). we need
this both on the dataset generation and the RL orchestration side of
things."*

Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop`
(wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To
realize "Decoupled DiLoCo across serverless executors" we need:

1. An abstraction layer that lets the framework launch N replicas on
   different serverless backends (Modal, HF Jobs, SageMaker, etc.) without
   per-backend code in the trainer.
2. A communication primitive that doesn't require inter-job NCCL/RDMA
   (most serverless executors don't expose that, and DiLoCo doesn't need
   it — sync happens once per ~500-1000 inner steps).

## Options considered

`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors:

| Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr |
|---|---|---|---|---|
| Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 |
| HuggingFace Jobs | no | ~60s | $4.18 | $9.50 |
| AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 |
| GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) |

Most expose a "spin up a job, run a script" interface. Few expose inter-job
networking; the ones that do require explicit cluster mode (extra cost +
config).

## Decision

**Adopt object-store rendezvous as the default DiLoCo communication
primitive across all serverless executors.** Specifically:

- `composer_replication.diloco.serverless` package
- `class ServerlessExecutor(Protocol)` — uniform interface with
  `launch_replicas / poll / stream_logs / cancel / collect /
  backend_name / supports_inter_replica_network`
- `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
  using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable
  bucket
- v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor`
- v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`

### Why object-store rendezvous (not NCCL across jobs)

DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per
H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at
typical post-training step rates. For a 1B-param model in bf16:

- Pseudo-gradient size: ~2 GB per replica per outer round
- Sync frequency: ~once per 30 minutes
- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
  storage with a single `PutObject` per replica + `GetObject` per other
  replica

Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read
spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this
without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per
GET. Total inter-replica communication cost: ~$0.05 per outer round.
**Negligible compared to GPU spend.**

By contrast, cross-job NCCL would require:
- Inter-job networking (mostly unavailable on serverless)
- Sustained low-latency connections (vs. burst-IO once per 30min)
- Backend-specific cluster mode (Modal-only on some platforms)

Object-store rendezvous decouples the algorithm from the executor and
matches DiLoCo's actual communication profile.

### Why Modal + HF Jobs as the v0 executors

- **Modal**: best dev velocity, sub-minute cold start, mature Python SDK,
  user already has CLI configured. Gives us a fast iteration loop for the
  serverless layer.
- **HuggingFace Jobs**: zero acquisition cost (HF token already wired up),
  brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr.
  Not the cheapest, but the right "default executor for HF users."

These two cover the spectrum of "fast for development" + "natural HF
integration." Other executors are documented and stubbed but not
implemented in v0.

## Consequences

### Accepted

- New package `composer_replication.diloco.serverless`:
  - `executor.py` — `ServerlessExecutor` Protocol + base class
  - `allreduce.py` — `ObjectStoreAllReduce` mockManager that drops into
    `make_diloco_outer_loop` with no changes to the existing wrapper
  - `modal.py` — `ModalExecutor` (~150 LOC)
  - `hf_jobs.py` — `HFJobsExecutor` (~150 LOC)
  - `replica_entrypoint.py` — the script each replica runs (loaded from
    HF Datasets / object store)
- New optional dependency `[serverless]` extra: `pip install -e .[serverless]`
  pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and
  `modal-client` (only if user opts in to Modal).
- Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part
  of this wave's commit) — local-only `file://` rendezvous between two
  Python processes in `tests/test_serverless_local.py`. Multi-cloud test
  is post-replication.

### Open / deferred

- **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous
  + verifying both converge. Deferred to a small-budget post-Wave-13 spike
  ($2-5 estimated). Not blocking for the v0 packaging.
- **HF Jobs API stability**: HF Jobs is a relatively new product. The
  recon flagged "API may evolve through 2026"; we pin to a specific
  `huggingface_hub` minor and bump deliberately.

### Trade-offs explicitly accepted

- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second
  cross-job NCCL but costs more and is Modal-only. Object-store rendezvous
  is the right default; users on Modal who want faster sync can override.
- We do NOT support job-internal multi-GPU training in this layer. The
  serverless layer is for **inter-replica** sync; intra-replica training
  uses the existing `make_diloco_outer_loop` (which itself can wrap
  multi-GPU FSDP via torchft).

## Source

`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent
recon, primary-sourced from each provider's official docs + pricing pages).