Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # V3 Substrate Coverage — Monarch / TorchForge / OpenEnv / VeRL / TRL / DiLoCo | |
| The brief's V3 clause asks the framework to cover six substrates. This doc | |
| maps each to **what we have** + **what we don't** + **why that's the right | |
| shape** given the substrate's status and the framework's scope. | |
| ## TRL — `huggingface/trl` | |
| **Status**: ✅ **Production target for v0.1.** Working code. | |
| **What we have**: | |
| - Research deep-dive: `research/04-verl-trl.md` § 3 (algorithm coverage: | |
| GRPO / DAPO / DPO / PRM, extension points, `_compute_loss` vs `compute_advantages`) | |
| - Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe A | |
| - Working code: `composer_replication.trainer.ComposerReplicationTrainer` | |
| subclasses `GRPOTrainer`, overrides `_compute_loss(model, inputs)` to | |
| compose 3 channels (`grpo + α·sdpo + β·trace_replay_dpo`) | |
| - Data collator: `composer_replication.trainer.data_collator.ComposerDataCollator` | |
| builds the `inputs` dict the trainer expects | |
| - DeepWiki audit: extension surface verified against TRL HEAD as of 2026-05-25 | |
| **What we don't**: | |
| - A full end-to-end training run (gated on real GPU rollouts + | |
| reward calculations — out of scope for CPU-budget deep-work-loop) | |
| **Why this shape**: TRL is the most-supported substrate for GRPO post-training. | |
| Its `GRPOTrainer.subclass.override._compute_loss` extension point is the | |
| cleanest path. Production v0.1 lives here. | |
| --- | |
| ## VeRL — `volcengine/verl` | |
| **Status**: 🟡 **Production target for v0.2 (multi-node scale).** Skeleton, not yet runnable. | |
| **What we have**: | |
| - Research deep-dive: `research/04-verl-trl.md` § 4 (3D-HybridEngine, | |
| resharding pattern, advantage estimator registry) | |
| - Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe B | |
| - Skeleton code: `spikes/005-integrated-trainer-skeleton/verl_path/` | |
| - `composer_adv.py` (110 LOC) — `@register_adv_est("composer_3channel")` decorator | |
| - `composer_config.yaml` (89 LOC) — full PPO trainer config with our advantage estimator wired in | |
| - DeepWiki audit: extension surface verified against VeRL HEAD as of 2026-05-25 | |
| **What we don't**: | |
| - A working VeRL run on real hardware (VeRL itself has steep setup; | |
| v0.1 prioritizes TRL because it's faster to iterate on) | |
| **Why this shape**: VeRL's 3D-HybridEngine and decentralized scheduler are | |
| better than TRL's at >32 GPU scale. We build the recipe but don't make it | |
| the default. The framework supports either path; users on >8-GPU clusters | |
| should use VeRL. | |
| --- | |
| ## DiLoCo — `meta-pytorch/torchft` | |
| **Status**: 🟡 **Outer-loop wrapper integrated.** Multi-replica convergence GPU-gated. | |
| **What we have**: | |
| - Research deep-dive: `research/02-diloco-family.md` (DiLoCo / OpenDiLoCo / | |
| Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 — full audit with primary | |
| source links and license/maturity assessment) | |
| - ADR: `docs/adrs/ADR-003-diloco-impl.md` — chose `torchft.local_sgd.DiLoCo` | |
| (BSD-3, Meta-maintained, library-not-research-code) over 4 alternatives | |
| - Working code: `composer_replication.diloco.make_diloco_outer_loop` | |
| wrapper. Documents the sign convention (pseudo-grad = θ_initial - θ_local). | |
| - Spike 008: 5/5 single-process tests. **Sign-convention test** is the | |
| single best test in the framework (per cross-model review). | |
| - Reconnaissance: `docs/research/DILOCO_RECONNAISSANCE.md` | |
| **What we don't**: | |
| - True multi-replica convergence test. Single-process post-hook | |
| sequencing prevents this (replica A's outer step completes before | |
| replica B's allreduce arrives). Real-multi-process test deferred to | |
| GPU phase. | |
| - Trainer integration. The wrapper is a context manager; wiring it into | |
| `ComposerReplicationTrainer.train()` lifecycle is a separate spike. | |
| **Why this shape**: DiLoCo's value proposition (decentralized inner training | |
| with sparse outer sync) only matters at multi-cluster scale. Our v0.1 | |
| target is single-cluster training with TRL. The DiLoCo wrapper is wired | |
| up so v0.2 multi-cluster training can switch it on with one config change. | |
| --- | |
| ## OpenEnv | |
| **Status**: 📋 **Reference pattern (substrate, not a choice).** | |
| **What we have**: | |
| - Research deep-dive: `research/03-monarch-torchforge-openenv.md` § OpenEnv | |
| (the env-format standard, how it interacts with TRL's `environment_factory=`) | |
| - Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe D — | |
| "OpenEnv is a substrate, not a choice" | |
| **What we don't**: | |
| - Direct OpenEnv code dependency. The framework's data path is | |
| OpenEnv-compatible by virtue of using TRL's API, which accepts | |
| `environment_factory=` kwargs that OpenEnv environments satisfy. | |
| **Why this shape**: OpenEnv is a *protocol* (how an env exposes itself | |
| to a trainer), not a library you depend on. You either implement an | |
| OpenEnv-compatible environment or you don't. Composer 2.5's "Feature | |
| Deletion" environment is OpenEnv-shaped; if a user provides one, our | |
| TRL trainer accepts it via `environment_factory=`. | |
| --- | |
| ## Monarch (Meta) | |
| **Status**: 📋 **Reference pattern (alternative coordination model).** | |
| **What we have**: | |
| - Research deep-dive: `research/03-monarch-torchforge-openenv.md` § Monarch | |
| (actor mesh, hardware abstractions, comparison to Ray) | |
| - Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe C — | |
| "TorchForge + Monarch (reference patterns only, not a production target)" | |
| **What we don't**: | |
| - Direct Monarch code dependency. We use DiLoCo's pseudo-gradient sync | |
| as our coordination model; Monarch's actor mesh is an alternative. | |
| **Why this shape**: Monarch is alive (Meta is shipping it) but it's a | |
| *coordination layer*, not an *algorithm*. Our framework integrates with | |
| PyTorch + TRL + torchft directly; Monarch would replace the coordination | |
| layer underneath. Documented as a future option; not a v0.1 dependency. | |
| --- | |
| ## TorchForge (Meta, paused) | |
| **Status**: 📋 **Reference only (upstream paused).** | |
| **What we have**: | |
| - Research deep-dive: `research/03-monarch-torchforge-openenv.md` § TorchForge | |
| — design lessons captured | |
| **What we don't**: | |
| - Code dependency. TorchForge as a project was paused by Meta. | |
| **Why this shape**: The brief asked us to research TorchForge. We did. | |
| The headline finding is "Meta paused this." That's a real research output | |
| even if it doesn't translate to code. | |
| --- | |
| ## Summary | |
| | Substrate | Research | Recipe | Code | Tests | v0.1 production? | | |
| |---|---|---|---|---|---| | |
| | TRL | ✅ | ✅ | ✅ | 38 + 9 + 3 = 50 | ✅ | | |
| | VeRL | ✅ | ✅ | 🟡 (skeleton) | — | v0.2 | | |
| | **PRIME-RL** (Wave 13) | ✅ | ✅ | 🟡 (loss adapter + config) | — | v0.2 (cleanest hook) | | |
| | DiLoCo (single-process) | ✅ | ✅ | ✅ | 5 (single-replica) | optional | | |
| | **DiLoCo over serverless** (Wave 13) | ✅ | ✅ ADR-005 | ✅ Local + 🟡 Modal/HFJobs | 9 multi-process | ✅ (local) / future (cloud) | | |
| | OpenEnv | ✅ | ✅ | n/a (protocol) | — | substrate | | |
| | **Monarch** (Wave 13) | ✅ | ✅ (actor layout) | 🟡 (skeleton) | — | v0.2+ | | |
| | TorchForge | ✅ | n/a (paused) | n/a | — | n/a | | |
| **8/8 substrates covered** (was 6/6 pre-Wave-13). New since Wave 13: | |
| PRIME-RL (the cleanest custom-loss hook), Monarch (Meta's actively-shipped | |
| agentic-stack component), and serverless DiLoCo (Modal/HF Jobs adapters | |
| + object-store rendezvous). The framework can now realize Decoupled | |
| DiLoCo across cloud executors **without any cross-job NCCL** — see | |
| ADR-005 for the design rationale. | |