Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # ADR-005 — Decoupled DiLoCo over serverless training systems | |
| **Status**: Accepted | |
| **Date**: 2026-05-26 | |
| **Wave**: 13 | |
| ## Context | |
| The brief's V2 clause says: | |
| > take that and combine it with diloco (decoupled, open, any variant of diloco) | |
| The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage | |
| modal or huggingface-jobs or other serverless training systems). we need | |
| this both on the dataset generation and the RL orchestration side of | |
| things."* | |
| Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop` | |
| (wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To | |
| realize "Decoupled DiLoCo across serverless executors" we need: | |
| 1. An abstraction layer that lets the framework launch N replicas on | |
| different serverless backends (Modal, HF Jobs, SageMaker, etc.) without | |
| per-backend code in the trainer. | |
| 2. A communication primitive that doesn't require inter-job NCCL/RDMA | |
| (most serverless executors don't expose that, and DiLoCo doesn't need | |
| it — sync happens once per ~500-1000 inner steps). | |
| ## Options considered | |
| `docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors: | |
| | Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr | | |
| |---|---|---|---|---| | |
| | Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 | | |
| | HuggingFace Jobs | no | ~60s | $4.18 | $9.50 | | |
| | AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 | | |
| | GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 | | |
| | Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 | | |
| | k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) | | |
| Most expose a "spin up a job, run a script" interface. Few expose inter-job | |
| networking; the ones that do require explicit cluster mode (extra cost + | |
| config). | |
| ## Decision | |
| **Adopt object-store rendezvous as the default DiLoCo communication | |
| primitive across all serverless executors.** Specifically: | |
| - `composer_replication.diloco.serverless` package | |
| - `class ServerlessExecutor(Protocol)` — uniform interface with | |
| `launch_replicas / poll / stream_logs / cancel / collect / | |
| backend_name / supports_inter_replica_network` | |
| - `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange | |
| using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable | |
| bucket | |
| - v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor` | |
| - v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor` | |
| ### Why object-store rendezvous (not NCCL across jobs) | |
| DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per | |
| H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at | |
| typical post-training step rates. For a 1B-param model in bf16: | |
| - Pseudo-gradient size: ~2 GB per replica per outer round | |
| - Sync frequency: ~once per 30 minutes | |
| - Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object | |
| storage with a single `PutObject` per replica + `GetObject` per other | |
| replica | |
| Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read | |
| spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this | |
| without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per | |
| GET. Total inter-replica communication cost: ~$0.05 per outer round. | |
| **Negligible compared to GPU spend.** | |
| By contrast, cross-job NCCL would require: | |
| - Inter-job networking (mostly unavailable on serverless) | |
| - Sustained low-latency connections (vs. burst-IO once per 30min) | |
| - Backend-specific cluster mode (Modal-only on some platforms) | |
| Object-store rendezvous decouples the algorithm from the executor and | |
| matches DiLoCo's actual communication profile. | |
| ### Why Modal + HF Jobs as the v0 executors | |
| - **Modal**: best dev velocity, sub-minute cold start, mature Python SDK, | |
| user already has CLI configured. Gives us a fast iteration loop for the | |
| serverless layer. | |
| - **HuggingFace Jobs**: zero acquisition cost (HF token already wired up), | |
| brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr. | |
| Not the cheapest, but the right "default executor for HF users." | |
| These two cover the spectrum of "fast for development" + "natural HF | |
| integration." Other executors are documented and stubbed but not | |
| implemented in v0. | |
| ## Consequences | |
| ### Accepted | |
| - New package `composer_replication.diloco.serverless`: | |
| - `executor.py` — `ServerlessExecutor` Protocol + base class | |
| - `allreduce.py` — `ObjectStoreAllReduce` mockManager that drops into | |
| `make_diloco_outer_loop` with no changes to the existing wrapper | |
| - `modal.py` — `ModalExecutor` (~150 LOC) | |
| - `hf_jobs.py` — `HFJobsExecutor` (~150 LOC) | |
| - `replica_entrypoint.py` — the script each replica runs (loaded from | |
| HF Datasets / object store) | |
| - New optional dependency `[serverless]` extra: `pip install -e .[serverless]` | |
| pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and | |
| `modal-client` (only if user opts in to Modal). | |
| - Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part | |
| of this wave's commit) — local-only `file://` rendezvous between two | |
| Python processes in `tests/test_serverless_local.py`. Multi-cloud test | |
| is post-replication. | |
| ### Open / deferred | |
| - **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous | |
| + verifying both converge. Deferred to a small-budget post-Wave-13 spike | |
| ($2-5 estimated). Not blocking for the v0 packaging. | |
| - **HF Jobs API stability**: HF Jobs is a relatively new product. The | |
| recon flagged "API may evolve through 2026"; we pin to a specific | |
| `huggingface_hub` minor and bump deliberately. | |
| ### Trade-offs explicitly accepted | |
| - We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second | |
| cross-job NCCL but costs more and is Modal-only. Object-store rendezvous | |
| is the right default; users on Modal who want faster sync can override. | |
| - We do NOT support job-internal multi-GPU training in this layer. The | |
| serverless layer is for **inter-replica** sync; intra-replica training | |
| uses the existing `make_diloco_outer_loop` (which itself can wrap | |
| multi-GPU FSDP via torchft). | |
| ## Source | |
| `docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent | |
| recon, primary-sourced from each provider's official docs + pricing pages). | |