Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
ADR-005 — Decoupled DiLoCo over serverless training systems
Status: Accepted Date: 2026-05-26 Wave: 13
Context
The brief's V2 clause says:
take that and combine it with diloco (decoupled, open, any variant of diloco)
The user expanded 2026-05-26: "Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). we need this both on the dataset generation and the RL orchestration side of things."
Spike 008 wrote composer_replication.diloco.make_diloco_outer_loop
(wraps torchft.local_sgd.DiLoCo) but that's a single-process API. To
realize "Decoupled DiLoCo across serverless executors" we need:
- An abstraction layer that lets the framework launch N replicas on different serverless backends (Modal, HF Jobs, SageMaker, etc.) without per-backend code in the trainer.
- A communication primitive that doesn't require inter-job NCCL/RDMA (most serverless executors don't expose that, and DiLoCo doesn't need it — sync happens once per ~500-1000 inner steps).
Options considered
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md audited 6 executors:
| Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr |
|---|---|---|---|---|
| Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 |
| HuggingFace Jobs | no | ~60s | $4.18 | $9.50 |
| AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 |
| GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
| k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) |
Most expose a "spin up a job, run a script" interface. Few expose inter-job networking; the ones that do require explicit cluster mode (extra cost + config).
Decision
Adopt object-store rendezvous as the default DiLoCo communication primitive across all serverless executors. Specifically:
composer_replication.diloco.serverlesspackageclass ServerlessExecutor(Protocol)— uniform interface withlaunch_replicas / poll / stream_logs / cancel / collect / backend_name / supports_inter_replica_networkclass ObjectStoreAllReduce— fsspec-backed pseudo-gradient exchange using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable bucket- v0 concrete adapters:
ModalExecutorandHFJobsExecutor - v0.1+ adapters:
RunPodExecutor,SageMakerExecutor,K8sExecutor
Why object-store rendezvous (not NCCL across jobs)
DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is once per H = 500-1000 inner steps, equivalent to ~10-30 minutes of wall-clock at typical post-training step rates. For a 1B-param model in bf16:
- Pseudo-gradient size: ~2 GB per replica per outer round
- Sync frequency: ~once per 30 minutes
- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
storage with a single
PutObjectper replica +GetObjectper other replica
Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read spread over 30 minutes = ~70 MB/s aggregate. S3 free-tier handles this without breaking a sweat, and S3 cross-job reads cost ~$0.0001 per GET. Total inter-replica communication cost: ~$0.05 per outer round. Negligible compared to GPU spend.
By contrast, cross-job NCCL would require:
- Inter-job networking (mostly unavailable on serverless)
- Sustained low-latency connections (vs. burst-IO once per 30min)
- Backend-specific cluster mode (Modal-only on some platforms)
Object-store rendezvous decouples the algorithm from the executor and matches DiLoCo's actual communication profile.
Why Modal + HF Jobs as the v0 executors
- Modal: best dev velocity, sub-minute cold start, mature Python SDK, user already has CLI configured. Gives us a fast iteration loop for the serverless layer.
- HuggingFace Jobs: zero acquisition cost (HF token already wired up), brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr. Not the cheapest, but the right "default executor for HF users."
These two cover the spectrum of "fast for development" + "natural HF integration." Other executors are documented and stubbed but not implemented in v0.
Consequences
Accepted
- New package
composer_replication.diloco.serverless:executor.py—ServerlessExecutorProtocol + base classallreduce.py—ObjectStoreAllReducemockManager that drops intomake_diloco_outer_loopwith no changes to the existing wrappermodal.py—ModalExecutor(~150 LOC)hf_jobs.py—HFJobsExecutor(~150 LOC)replica_entrypoint.py— the script each replica runs (loaded from HF Datasets / object store)
- New optional dependency
[serverless]extra:pip install -e .[serverless]pullsfsspec,s3fs,huggingface_hub(already a transitive dep), andmodal-client(only if user opts in to Modal). - Smoke test in
spikes/009-decoupled-diloco/(new, deferred — not part of this wave's commit) — local-onlyfile://rendezvous between two Python processes intests/test_serverless_local.py. Multi-cloud test is post-replication.
Open / deferred
- Real serverless smoke: spinning up 2 Modal containers + S3 rendezvous
- verifying both converge. Deferred to a small-budget post-Wave-13 spike ($2-5 estimated). Not blocking for the v0 packaging.
- HF Jobs API stability: HF Jobs is a relatively new product. The
recon flagged "API may evolve through 2026"; we pin to a specific
huggingface_hubminor and bump deliberately.
Trade-offs explicitly accepted
- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second cross-job NCCL but costs more and is Modal-only. Object-store rendezvous is the right default; users on Modal who want faster sync can override.
- We do NOT support job-internal multi-GPU training in this layer. The
serverless layer is for inter-replica sync; intra-replica training
uses the existing
make_diloco_outer_loop(which itself can wrap multi-GPU FSDP via torchft).
Source
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md (2026-05-26 subagent
recon, primary-sourced from each provider's official docs + pricing pages).