Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 7,351 Bytes
d88715c b266c31 d88715c b266c31 d88715c b266c31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | # V3 Substrate Coverage — Monarch / TorchForge / OpenEnv / VeRL / TRL / DiLoCo
The brief's V3 clause asks the framework to cover six substrates. This doc
maps each to **what we have** + **what we don't** + **why that's the right
shape** given the substrate's status and the framework's scope.
## TRL — `huggingface/trl`
**Status**: ✅ **Production target for v0.1.** Working code.
**What we have**:
- Research deep-dive: `research/04-verl-trl.md` § 3 (algorithm coverage:
GRPO / DAPO / DPO / PRM, extension points, `_compute_loss` vs `compute_advantages`)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe A
- Working code: `composer_replication.trainer.ComposerReplicationTrainer`
subclasses `GRPOTrainer`, overrides `_compute_loss(model, inputs)` to
compose 3 channels (`grpo + α·sdpo + β·trace_replay_dpo`)
- Data collator: `composer_replication.trainer.data_collator.ComposerDataCollator`
builds the `inputs` dict the trainer expects
- DeepWiki audit: extension surface verified against TRL HEAD as of 2026-05-25
**What we don't**:
- A full end-to-end training run (gated on real GPU rollouts +
reward calculations — out of scope for CPU-budget deep-work-loop)
**Why this shape**: TRL is the most-supported substrate for GRPO post-training.
Its `GRPOTrainer.subclass.override._compute_loss` extension point is the
cleanest path. Production v0.1 lives here.
---
## VeRL — `volcengine/verl`
**Status**: 🟡 **Production target for v0.2 (multi-node scale).** Skeleton, not yet runnable.
**What we have**:
- Research deep-dive: `research/04-verl-trl.md` § 4 (3D-HybridEngine,
resharding pattern, advantage estimator registry)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe B
- Skeleton code: `spikes/005-integrated-trainer-skeleton/verl_path/`
- `composer_adv.py` (110 LOC) — `@register_adv_est("composer_3channel")` decorator
- `composer_config.yaml` (89 LOC) — full PPO trainer config with our advantage estimator wired in
- DeepWiki audit: extension surface verified against VeRL HEAD as of 2026-05-25
**What we don't**:
- A working VeRL run on real hardware (VeRL itself has steep setup;
v0.1 prioritizes TRL because it's faster to iterate on)
**Why this shape**: VeRL's 3D-HybridEngine and decentralized scheduler are
better than TRL's at >32 GPU scale. We build the recipe but don't make it
the default. The framework supports either path; users on >8-GPU clusters
should use VeRL.
---
## DiLoCo — `meta-pytorch/torchft`
**Status**: 🟡 **Outer-loop wrapper integrated.** Multi-replica convergence GPU-gated.
**What we have**:
- Research deep-dive: `research/02-diloco-family.md` (DiLoCo / OpenDiLoCo /
Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 — full audit with primary
source links and license/maturity assessment)
- ADR: `docs/adrs/ADR-003-diloco-impl.md` — chose `torchft.local_sgd.DiLoCo`
(BSD-3, Meta-maintained, library-not-research-code) over 4 alternatives
- Working code: `composer_replication.diloco.make_diloco_outer_loop`
wrapper. Documents the sign convention (pseudo-grad = θ_initial - θ_local).
- Spike 008: 5/5 single-process tests. **Sign-convention test** is the
single best test in the framework (per cross-model review).
- Reconnaissance: `docs/research/DILOCO_RECONNAISSANCE.md`
**What we don't**:
- True multi-replica convergence test. Single-process post-hook
sequencing prevents this (replica A's outer step completes before
replica B's allreduce arrives). Real-multi-process test deferred to
GPU phase.
- Trainer integration. The wrapper is a context manager; wiring it into
`ComposerReplicationTrainer.train()` lifecycle is a separate spike.
**Why this shape**: DiLoCo's value proposition (decentralized inner training
with sparse outer sync) only matters at multi-cluster scale. Our v0.1
target is single-cluster training with TRL. The DiLoCo wrapper is wired
up so v0.2 multi-cluster training can switch it on with one config change.
---
## OpenEnv
**Status**: 📋 **Reference pattern (substrate, not a choice).**
**What we have**:
- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § OpenEnv
(the env-format standard, how it interacts with TRL's `environment_factory=`)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe D —
"OpenEnv is a substrate, not a choice"
**What we don't**:
- Direct OpenEnv code dependency. The framework's data path is
OpenEnv-compatible by virtue of using TRL's API, which accepts
`environment_factory=` kwargs that OpenEnv environments satisfy.
**Why this shape**: OpenEnv is a *protocol* (how an env exposes itself
to a trainer), not a library you depend on. You either implement an
OpenEnv-compatible environment or you don't. Composer 2.5's "Feature
Deletion" environment is OpenEnv-shaped; if a user provides one, our
TRL trainer accepts it via `environment_factory=`.
---
## Monarch (Meta)
**Status**: 📋 **Reference pattern (alternative coordination model).**
**What we have**:
- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § Monarch
(actor mesh, hardware abstractions, comparison to Ray)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe C —
"TorchForge + Monarch (reference patterns only, not a production target)"
**What we don't**:
- Direct Monarch code dependency. We use DiLoCo's pseudo-gradient sync
as our coordination model; Monarch's actor mesh is an alternative.
**Why this shape**: Monarch is alive (Meta is shipping it) but it's a
*coordination layer*, not an *algorithm*. Our framework integrates with
PyTorch + TRL + torchft directly; Monarch would replace the coordination
layer underneath. Documented as a future option; not a v0.1 dependency.
---
## TorchForge (Meta, paused)
**Status**: 📋 **Reference only (upstream paused).**
**What we have**:
- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § TorchForge
— design lessons captured
**What we don't**:
- Code dependency. TorchForge as a project was paused by Meta.
**Why this shape**: The brief asked us to research TorchForge. We did.
The headline finding is "Meta paused this." That's a real research output
even if it doesn't translate to code.
---
## Summary
| Substrate | Research | Recipe | Code | Tests | v0.1 production? |
|---|---|---|---|---|---|
| TRL | ✅ | ✅ | ✅ | 38 + 9 + 3 = 50 | ✅ |
| VeRL | ✅ | ✅ | 🟡 (skeleton) | — | v0.2 |
| **PRIME-RL** (Wave 13) | ✅ | ✅ | 🟡 (loss adapter + config) | — | v0.2 (cleanest hook) |
| DiLoCo (single-process) | ✅ | ✅ | ✅ | 5 (single-replica) | optional |
| **DiLoCo over serverless** (Wave 13) | ✅ | ✅ ADR-005 | ✅ Local + 🟡 Modal/HFJobs | 9 multi-process | ✅ (local) / future (cloud) |
| OpenEnv | ✅ | ✅ | n/a (protocol) | — | substrate |
| **Monarch** (Wave 13) | ✅ | ✅ (actor layout) | 🟡 (skeleton) | — | v0.2+ |
| TorchForge | ✅ | n/a (paused) | n/a | — | n/a |
**8/8 substrates covered** (was 6/6 pre-Wave-13). New since Wave 13:
PRIME-RL (the cleanest custom-loss hook), Monarch (Meta's actively-shipped
agentic-stack component), and serverless DiLoCo (Modal/HF Jobs adapters
+ object-store rendezvous). The framework can now realize Decoupled
DiLoCo across cloud executors **without any cross-job NCCL** — see
ADR-005 for the design rationale.
|