composer-replication-framework / docs /adrs /ADR-005-serverless-diloco.md

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 12 days ago

6.24 kB

	# ADR-005 — Decoupled DiLoCo over serverless training systems

	Status: Accepted
	Date: 2026-05-26
	Wave: 13

	## Context

	The brief's V2 clause says:

	> take that and combine it with diloco (decoupled, open, any variant of diloco)

	The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage
	modal or huggingface-jobs or other serverless training systems). we need
	this both on the dataset generation and the RL orchestration side of
	things."*

	Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop`
	(wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To
	realize "Decoupled DiLoCo across serverless executors" we need:

	1. An abstraction layer that lets the framework launch N replicas on
	different serverless backends (Modal, HF Jobs, SageMaker, etc.) without
	per-backend code in the trainer.
	2. A communication primitive that doesn't require inter-job NCCL/RDMA
	(most serverless executors don't expose that, and DiLoCo doesn't need
	it — sync happens once per ~500-1000 inner steps).

	## Options considered

	`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors:

	\| Executor \| Inter-job network \| Cold start \| $/A100·hr \| $/H100·hr \|
	\|---\|---\|---\|---\|---\|
	\| Modal \| yes (cluster mode) \| ~30s \| $1.95 \| $5.50 \|
	\| HuggingFace Jobs \| no \| ~60s \| $4.18 \| $9.50 \|
	\| AWS SageMaker training \| yes (warm pools) \| ~3-5min \| ~$3.06 \| ~$8.50 \|
	\| GCP Vertex AI \| yes (cluster) \| ~5-10min \| ~$3.67 \| ~$10 \|
	\| Azure ML \| yes (cluster) \| ~5-10min \| ~$3.67 \| ~$10 \|
	\| k8s + Volcano/KubeRay \| yes (cluster IP) \| ~30-90s \| (BYO) \| (BYO) \|

	Most expose a "spin up a job, run a script" interface. Few expose inter-job
	networking; the ones that do require explicit cluster mode (extra cost +
	config).

	## Decision

	**Adopt object-store rendezvous as the default DiLoCo communication
	primitive across all serverless executors.** Specifically:

	- `composer_replication.diloco.serverless` package
	- `class ServerlessExecutor(Protocol)` — uniform interface with
	`launch_replicas / poll / stream_logs / cancel / collect /
	backend_name / supports_inter_replica_network`
	- `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
	using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable
	bucket
	- v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor`
	- v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`

	### Why object-store rendezvous (not NCCL across jobs)

	DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per
	H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at
	typical post-training step rates. For a 1B-param model in bf16:

	- Pseudo-gradient size: ~2 GB per replica per outer round
	- Sync frequency: ~once per 30 minutes
	- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
	storage with a single `PutObject` per replica + `GetObject` per other
	replica

	Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read
	spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this
	without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per
	GET. Total inter-replica communication cost: ~$0.05 per outer round.
	Negligible compared to GPU spend.

	By contrast, cross-job NCCL would require:
	- Inter-job networking (mostly unavailable on serverless)
	- Sustained low-latency connections (vs. burst-IO once per 30min)
	- Backend-specific cluster mode (Modal-only on some platforms)

	Object-store rendezvous decouples the algorithm from the executor and
	matches DiLoCo's actual communication profile.

	### Why Modal + HF Jobs as the v0 executors

	- Modal: best dev velocity, sub-minute cold start, mature Python SDK,
	user already has CLI configured. Gives us a fast iteration loop for the
	serverless layer.
	- HuggingFace Jobs: zero acquisition cost (HF token already wired up),
	brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr.
	Not the cheapest, but the right "default executor for HF users."

	These two cover the spectrum of "fast for development" + "natural HF
	integration." Other executors are documented and stubbed but not
	implemented in v0.

	## Consequences

	### Accepted

	- New package `composer_replication.diloco.serverless`:
	- `executor.py` — `ServerlessExecutor` Protocol + base class
	- `allreduce.py` — `ObjectStoreAllReduce` mockManager that drops into
	`make_diloco_outer_loop` with no changes to the existing wrapper
	- `modal.py` — `ModalExecutor` (~150 LOC)
	- `hf_jobs.py` — `HFJobsExecutor` (~150 LOC)
	- `replica_entrypoint.py` — the script each replica runs (loaded from
	HF Datasets / object store)
	- New optional dependency `[serverless]` extra: `pip install -e .[serverless]`
	pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and
	`modal-client` (only if user opts in to Modal).
	- Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part
	of this wave's commit) — local-only `file://` rendezvous between two
	Python processes in `tests/test_serverless_local.py`. Multi-cloud test
	is post-replication.

	### Open / deferred

	- Real serverless smoke: spinning up 2 Modal containers + S3 rendezvous
	+ verifying both converge. Deferred to a small-budget post-Wave-13 spike
	($2-5 estimated). Not blocking for the v0 packaging.
	- HF Jobs API stability: HF Jobs is a relatively new product. The
	recon flagged "API may evolve through 2026"; we pin to a specific
	`huggingface_hub` minor and bump deliberately.

	### Trade-offs explicitly accepted

	- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second
	cross-job NCCL but costs more and is Modal-only. Object-store rendezvous
	is the right default; users on Modal who want faster sync can override.
	- We do NOT support job-internal multi-GPU training in this layer. The
	serverless layer is for inter-replica sync; intra-replica training
	uses the existing `make_diloco_outer_loop` (which itself can wrap
	multi-GPU FSDP via torchft).

	## Source

	`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent
	recon, primary-sourced from each provider's official docs + pricing pages).