# ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL **Status**: Accepted **Date**: 2026-05-26 **Wave**: 13 ## Context The brief's V3 clause names six substrates: **monarch, torchforge, openenv, VeRL, TRL** (plus DiLoCo). Cross-model review (Wave 11) flagged that V3 was thin on the RL-framework side: TRL has working code, VeRL has a config skeleton, and Monarch/TorchForge/OpenEnv are research-only. User's 2026-05-26 expansion: *"see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."* `docs/research/RL_FRAMEWORKS_LANDSCAPE.md` audited: - 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory, DeepSpeed-Chat - 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat ## Options considered | Framework | License | GRPO/DAPO? | Custom-loss extension | Verdict | |---|---|---|---|---| | OpenRLHF | Apache-2 | ✅ DAPO | Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC) | Strong but heavyweight | | **PRIME-RL** | **Apache-2** | **✅ GRPO + DAPO** | **First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC)** | **Chosen** | | NeMo-Aligner | Apache-2 | ❌ no GRPO/DAPO | n/a | Reject | | Unsloth | Apache-2 | TRL patcher | Closed `unsloth_zoo` loss kernels — unhookable | Reject | | LLaMA-Factory | Apache-2 | ❌ delegates to EasyR1 | n/a | Reject | | DeepSpeed-Chat | Apache-2 | ❌ PPO+DPO only | feature-stale since 2023 | Reject | | Meta stack | License | Active? | Role | |---|---|---|---| | **Monarch** | **BSD-3** | **✅ v0.4.1 stable, v0.5 dev** | **Actor mesh — coordination layer for any SPMD trainer** | | TorchTitan | BSD-3 | ✅ active | Distributed-training stack (already a transitive dep of PRIME-RL) | | TorchForge | BSD-3 | ❌ paused | Patterns only, per repo banner | | torchchat | BSD-3 | active | Inference only — out of scope | ## Decision **Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the agentic-stack coordination layer.** ### Why PRIME-RL PRIME-RL ships a **first-class `CustomLossConfig` with an `import_path`** that lets us drop in a Python function returning a tensor. The config exposes a `LossInputs` struct with exactly the tensors we need: `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask`. This is **the cleanest possible extension point for a 3-channel loss** — no fork, no Trainer subclass, no monkey- patching. It also uses the `verifiers` env protocol (OpenEnv-compatible by design), so it slots into the framework's existing data path without translation. PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2 (32B QwQ); production-tested on real distributed runs. ### Why Monarch (not TorchForge or TorchTitan as a top-level) - **Monarch is what's actually shipping** from Meta's agentic stack. v0.4.1 is stable, v0.5 dev daily. BSD-3. - **TorchForge is paused** per its own repo banner. We document it (research/03) but don't depend on it. - **TorchTitan is a transitive dep** of PRIME-RL already, so we get its benefits without needing to build a direct integration. If we wanted a TorchTitan-only path, it would be redundant with PRIME-RL. - **torchchat is inference-only** and doesn't fit the training-framework conversation. Monarch's role in our stack: **the actor mesh that hosts trainer/generator/ rewarder/judge actors**. PRIME-RL's three-actor split (trainer, generator, rewarder) maps naturally onto Monarch primitives. ## Consequences ### Accepted - `composer_replication/recipes/prime_rl/` directory: - `prime_rl_recipe.md` — integration recipe (parallel to TRL Recipe A, VeRL Recipe B) - `composer_loss.py` — the 3-channel loss adapted to PRIME-RL's `LossInputs` struct (~200-300 LOC) - `prime_rl_config.yaml` — example PRIME-RL config wiring our loss in - `composer_replication/recipes/monarch/` directory: - `monarch_actor_layout.md` — design doc for the actor mesh - `actors.py` — placeholder Monarch actor definitions (skeleton only; full integration is post-replication) - New optional dependencies in `pyproject.toml`: - `[prime-rl]` extra: `prime-rl>=0.5` - `[monarch]` extra: `monarch>=0.4.1` - `docs/V3_SUBSTRATE_COVERAGE.md` updated to reflect the new additions. ### Three-recipe production matrix | User scenario | Recommended recipe | |---|---| | Quick start, single-cluster, ≤7B | TRL Recipe A | | Production multi-node, ≤32B | VeRL Recipe B | | Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) | | Coordination-heavy multi-actor RL | Monarch + any of the above | ### Trade-offs explicitly accepted - **Three RL frameworks is a maintenance burden.** We accept this because no single one covers all the user scenarios above. The framework's contribution is the 3-channel loss + the trace-replay channel, expressed in three different framework idioms. Each recipe is ~200-300 LOC; total triplication tax ~700 LOC vs. picking one framework. - **Monarch is BSD-3 not MIT.** The framework is MIT; users opting in to Monarch take on its license. Documented in pyproject.toml's optional extras. - **PRIME-RL's API may evolve.** The `LossInputs` struct is currently the contract; if PRIME-RL stabilizes a different shape we'd need to bump. Pin to v0.5.x in our optional extras. ## Source `docs/research/RL_FRAMEWORKS_LANDSCAPE.md` (2026-05-26 subagent recon, primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release metadata).