"""composer_replication.diloco.serverless — run Decoupled DiLoCo across serverless training systems (Modal, HuggingFace Jobs, SageMaker, k8s, …). Per ADR-005, the design rests on two abstractions: 1. `ServerlessExecutor` Protocol — a uniform interface for spinning up N replicas on different cloud backends. Each backend (Modal, HF Jobs, SageMaker, etc.) gets a concrete adapter that implements the Protocol. 2. `ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange that replaces the in-process `torchft.Manager.allreduce` call. The communication pattern is `S3 PutObject + N GetObjects` once per ~500-1000 inner steps, which matches DiLoCo's actual sync cadence (paper arXiv:2311.08105 §3.2). Bandwidth: ~2 GB / 30 minutes per replica for 1B-param bf16, well within S3 free-tier. The framework's existing `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo`. To run that across N serverless replicas: >>> from composer_replication.diloco.serverless import ( ... LocalProcessExecutor, ... ObjectStoreAllReduce, ... ) >>> rendezvous = ObjectStoreAllReduce("s3://my-bucket/diloco-runs/run42/") >>> executor = LocalProcessExecutor() >>> handles = executor.launch_replicas( ... n_replicas=4, ... entrypoint="composer_replication.diloco.serverless.replica_entrypoint", ... entrypoint_args={"rendezvous": rendezvous.uri, "rank_env": "REPLICA_RANK"}, ... ) >>> result = executor.collect(handles, timeout=3600) Module layout: - `executor.py` — `ServerlessExecutor` Protocol + base classes + `LocalProcessExecutor` - `allreduce.py` — `ObjectStoreAllReduce` + `MockManager` (drops into torchft path) - `modal.py` — `ModalExecutor` (skeleton — implements when modal-client is available) - `hf_jobs.py` — `HFJobsExecutor` (skeleton — uses huggingface_hub.run_job) - `replica_entrypoint.py` — script each replica runs (loaded from object store) Optional dependency: `pip install -e .[serverless]` pulls fsspec + s3fs + gcsfs. Modal/HF Jobs adapters require `modal` and `huggingface_hub` respectively; both are checked at adapter init time, not at module import. """ from __future__ import annotations from composer_replication.diloco.serverless.allreduce import ( MockManager, ObjectStoreAllReduce, ) from composer_replication.diloco.serverless.executor import ( LocalProcessExecutor, ReplicaHandle, ServerlessExecutor, ) from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor from composer_replication.diloco.serverless.modal import ModalExecutor __all__ = [ "HFJobsExecutor", "LocalProcessExecutor", "MockManager", "ModalExecutor", "ObjectStoreAllReduce", "ReplicaHandle", "ServerlessExecutor", ]