Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL | |
| **Status**: Accepted | |
| **Date**: 2026-05-26 | |
| **Wave**: 13 | |
| ## Context | |
| The brief's V3 clause names six substrates: **monarch, torchforge, | |
| openenv, VeRL, TRL** (plus DiLoCo). Cross-model review (Wave 11) flagged | |
| that V3 was thin on the RL-framework side: TRL has working code, VeRL has | |
| a config skeleton, and Monarch/TorchForge/OpenEnv are research-only. | |
| User's 2026-05-26 expansion: *"see if there are other frameworks that are | |
| more popular that we could try to use. meta's pytorch agentic stack | |
| components are something that I'd like to explore."* | |
| `docs/research/RL_FRAMEWORKS_LANDSCAPE.md` audited: | |
| - 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory, | |
| DeepSpeed-Chat | |
| - 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat | |
| ## Options considered | |
| | Framework | License | GRPO/DAPO? | Custom-loss extension | Verdict | | |
| |---|---|---|---|---| | |
| | OpenRLHF | Apache-2 | ✅ DAPO | Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC) | Strong but heavyweight | | |
| | **PRIME-RL** | **Apache-2** | **✅ GRPO + DAPO** | **First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC)** | **Chosen** | | |
| | NeMo-Aligner | Apache-2 | ❌ no GRPO/DAPO | n/a | Reject | | |
| | Unsloth | Apache-2 | TRL patcher | Closed `unsloth_zoo` loss kernels — unhookable | Reject | | |
| | LLaMA-Factory | Apache-2 | ❌ delegates to EasyR1 | n/a | Reject | | |
| | DeepSpeed-Chat | Apache-2 | ❌ PPO+DPO only | feature-stale since 2023 | Reject | | |
| | Meta stack | License | Active? | Role | | |
| |---|---|---|---| | |
| | **Monarch** | **BSD-3** | **✅ v0.4.1 stable, v0.5 dev** | **Actor mesh — coordination layer for any SPMD trainer** | | |
| | TorchTitan | BSD-3 | ✅ active | Distributed-training stack (already a transitive dep of PRIME-RL) | | |
| | TorchForge | BSD-3 | ❌ paused | Patterns only, per repo banner | | |
| | torchchat | BSD-3 | active | Inference only — out of scope | | |
| ## Decision | |
| **Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the | |
| agentic-stack coordination layer.** | |
| ### Why PRIME-RL | |
| PRIME-RL ships a **first-class `CustomLossConfig` with an `import_path`** | |
| that lets us drop in a Python function returning a tensor. The config | |
| exposes a `LossInputs` struct with exactly the tensors we need: | |
| `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, | |
| `advantages`, `loss_mask`. This is **the cleanest possible extension | |
| point for a 3-channel loss** — no fork, no Trainer subclass, no monkey- | |
| patching. | |
| It also uses the `verifiers` env protocol (OpenEnv-compatible by design), | |
| so it slots into the framework's existing data path without translation. | |
| PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2 | |
| (32B QwQ); production-tested on real distributed runs. | |
| ### Why Monarch (not TorchForge or TorchTitan as a top-level) | |
| - **Monarch is what's actually shipping** from Meta's agentic stack. v0.4.1 | |
| is stable, v0.5 dev daily. BSD-3. | |
| - **TorchForge is paused** per its own repo banner. We document it | |
| (research/03) but don't depend on it. | |
| - **TorchTitan is a transitive dep** of PRIME-RL already, so we get its | |
| benefits without needing to build a direct integration. If we wanted a | |
| TorchTitan-only path, it would be redundant with PRIME-RL. | |
| - **torchchat is inference-only** and doesn't fit the training-framework | |
| conversation. | |
| Monarch's role in our stack: **the actor mesh that hosts trainer/generator/ | |
| rewarder/judge actors**. PRIME-RL's three-actor split (trainer, generator, | |
| rewarder) maps naturally onto Monarch primitives. | |
| ## Consequences | |
| ### Accepted | |
| - `composer_replication/recipes/prime_rl/` directory: | |
| - `prime_rl_recipe.md` — integration recipe (parallel to TRL Recipe A, | |
| VeRL Recipe B) | |
| - `composer_loss.py` — the 3-channel loss adapted to PRIME-RL's | |
| `LossInputs` struct (~200-300 LOC) | |
| - `prime_rl_config.yaml` — example PRIME-RL config wiring our loss in | |
| - `composer_replication/recipes/monarch/` directory: | |
| - `monarch_actor_layout.md` — design doc for the actor mesh | |
| - `actors.py` — placeholder Monarch actor definitions (skeleton only; | |
| full integration is post-replication) | |
| - New optional dependencies in `pyproject.toml`: | |
| - `[prime-rl]` extra: `prime-rl>=0.5` | |
| - `[monarch]` extra: `monarch>=0.4.1` | |
| - `docs/V3_SUBSTRATE_COVERAGE.md` updated to reflect the new additions. | |
| ### Three-recipe production matrix | |
| | User scenario | Recommended recipe | | |
| |---|---| | |
| | Quick start, single-cluster, ≤7B | TRL Recipe A | | |
| | Production multi-node, ≤32B | VeRL Recipe B | | |
| | Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) | | |
| | Coordination-heavy multi-actor RL | Monarch + any of the above | | |
| ### Trade-offs explicitly accepted | |
| - **Three RL frameworks is a maintenance burden.** We accept this because | |
| no single one covers all the user scenarios above. The framework's | |
| contribution is the 3-channel loss + the trace-replay channel, expressed | |
| in three different framework idioms. Each recipe is ~200-300 LOC; total | |
| triplication tax ~700 LOC vs. picking one framework. | |
| - **Monarch is BSD-3 not MIT.** The framework is MIT; users opting in to | |
| Monarch take on its license. Documented in pyproject.toml's optional | |
| extras. | |
| - **PRIME-RL's API may evolve.** The `LossInputs` struct is currently the | |
| contract; if PRIME-RL stabilizes a different shape we'd need to bump. | |
| Pin to v0.5.x in our optional extras. | |
| ## Source | |
| `docs/research/RL_FRAMEWORKS_LANDSCAPE.md` (2026-05-26 subagent recon, | |
| primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release | |
| metadata). | |