File size: 6,236 Bytes
b266c31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# ADR-005 — Decoupled DiLoCo over serverless training systems

**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13

## Context

The brief's V2 clause says:

> take that and combine it with diloco (decoupled, open, any variant of diloco)

The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage
modal or huggingface-jobs or other serverless training systems). we need
this both on the dataset generation and the RL orchestration side of
things."*

Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop`
(wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To
realize "Decoupled DiLoCo across serverless executors" we need:

1. An abstraction layer that lets the framework launch N replicas on
   different serverless backends (Modal, HF Jobs, SageMaker, etc.) without
   per-backend code in the trainer.
2. A communication primitive that doesn't require inter-job NCCL/RDMA
   (most serverless executors don't expose that, and DiLoCo doesn't need
   it — sync happens once per ~500-1000 inner steps).

## Options considered

`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors:

| Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr |
|---|---|---|---|---|
| Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 |
| HuggingFace Jobs | no | ~60s | $4.18 | $9.50 |
| AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 |
| GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) |

Most expose a "spin up a job, run a script" interface. Few expose inter-job
networking; the ones that do require explicit cluster mode (extra cost +
config).

## Decision

**Adopt object-store rendezvous as the default DiLoCo communication
primitive across all serverless executors.** Specifically:

- `composer_replication.diloco.serverless` package
- `class ServerlessExecutor(Protocol)` — uniform interface with
  `launch_replicas / poll / stream_logs / cancel / collect /
  backend_name / supports_inter_replica_network`
- `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
  using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable
  bucket
- v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor`
- v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`

### Why object-store rendezvous (not NCCL across jobs)

DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per
H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at
typical post-training step rates. For a 1B-param model in bf16:

- Pseudo-gradient size: ~2 GB per replica per outer round
- Sync frequency: ~once per 30 minutes
- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
  storage with a single `PutObject` per replica + `GetObject` per other
  replica

Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read
spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this
without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per
GET. Total inter-replica communication cost: ~$0.05 per outer round.
**Negligible compared to GPU spend.**

By contrast, cross-job NCCL would require:
- Inter-job networking (mostly unavailable on serverless)
- Sustained low-latency connections (vs. burst-IO once per 30min)
- Backend-specific cluster mode (Modal-only on some platforms)

Object-store rendezvous decouples the algorithm from the executor and
matches DiLoCo's actual communication profile.

### Why Modal + HF Jobs as the v0 executors

- **Modal**: best dev velocity, sub-minute cold start, mature Python SDK,
  user already has CLI configured. Gives us a fast iteration loop for the
  serverless layer.
- **HuggingFace Jobs**: zero acquisition cost (HF token already wired up),
  brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr.
  Not the cheapest, but the right "default executor for HF users."

These two cover the spectrum of "fast for development" + "natural HF
integration." Other executors are documented and stubbed but not
implemented in v0.

## Consequences

### Accepted

- New package `composer_replication.diloco.serverless`:
  - `executor.py``ServerlessExecutor` Protocol + base class
  - `allreduce.py``ObjectStoreAllReduce` mockManager that drops into
    `make_diloco_outer_loop` with no changes to the existing wrapper
  - `modal.py` — `ModalExecutor` (~150 LOC)
  - `hf_jobs.py` — `HFJobsExecutor` (~150 LOC)
  - `replica_entrypoint.py` — the script each replica runs (loaded from
    HF Datasets / object store)
- New optional dependency `[serverless]` extra: `pip install -e .[serverless]`
  pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and
  `modal-client` (only if user opts in to Modal).
- Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part
  of this wave's commit) — local-only `file://` rendezvous between two
  Python processes in `tests/test_serverless_local.py`. Multi-cloud test
  is post-replication.

### Open / deferred

- **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous
  + verifying both converge. Deferred to a small-budget post-Wave-13 spike
  ($2-5 estimated). Not blocking for the v0 packaging.
- **HF Jobs API stability**: HF Jobs is a relatively new product. The
  recon flagged "API may evolve through 2026"; we pin to a specific
  `huggingface_hub` minor and bump deliberately.

### Trade-offs explicitly accepted

- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second
  cross-job NCCL but costs more and is Modal-only. Object-store rendezvous
  is the right default; users on Modal who want faster sync can override.
- We do NOT support job-internal multi-GPU training in this layer. The
  serverless layer is for **inter-replica** sync; intra-replica training
  uses the existing `make_diloco_outer_loop` (which itself can wrap
  multi-GPU FSDP via torchft).

## Source

`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent
recon, primary-sourced from each provider's official docs + pricing pages).