composer-replication-framework / docs /adrs /ADR-006-rl-frameworks.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31
# ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13
## Context
The brief's V3 clause names six substrates: **monarch, torchforge,
openenv, VeRL, TRL** (plus DiLoCo). Cross-model review (Wave 11) flagged
that V3 was thin on the RL-framework side: TRL has working code, VeRL has
a config skeleton, and Monarch/TorchForge/OpenEnv are research-only.
User's 2026-05-26 expansion: *"see if there are other frameworks that are
more popular that we could try to use. meta's pytorch agentic stack
components are something that I'd like to explore."*
`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` audited:
- 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory,
DeepSpeed-Chat
- 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat
## Options considered
| Framework | License | GRPO/DAPO? | Custom-loss extension | Verdict |
|---|---|---|---|---|
| OpenRLHF | Apache-2 | ✅ DAPO | Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC) | Strong but heavyweight |
| **PRIME-RL** | **Apache-2** | **✅ GRPO + DAPO** | **First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC)** | **Chosen** |
| NeMo-Aligner | Apache-2 | ❌ no GRPO/DAPO | n/a | Reject |
| Unsloth | Apache-2 | TRL patcher | Closed `unsloth_zoo` loss kernels — unhookable | Reject |
| LLaMA-Factory | Apache-2 | ❌ delegates to EasyR1 | n/a | Reject |
| DeepSpeed-Chat | Apache-2 | ❌ PPO+DPO only | feature-stale since 2023 | Reject |
| Meta stack | License | Active? | Role |
|---|---|---|---|
| **Monarch** | **BSD-3** | **✅ v0.4.1 stable, v0.5 dev** | **Actor mesh — coordination layer for any SPMD trainer** |
| TorchTitan | BSD-3 | ✅ active | Distributed-training stack (already a transitive dep of PRIME-RL) |
| TorchForge | BSD-3 | ❌ paused | Patterns only, per repo banner |
| torchchat | BSD-3 | active | Inference only — out of scope |
## Decision
**Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the
agentic-stack coordination layer.**
### Why PRIME-RL
PRIME-RL ships a **first-class `CustomLossConfig` with an `import_path`**
that lets us drop in a Python function returning a tensor. The config
exposes a `LossInputs` struct with exactly the tensors we need:
`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
`advantages`, `loss_mask`. This is **the cleanest possible extension
point for a 3-channel loss** — no fork, no Trainer subclass, no monkey-
patching.
It also uses the `verifiers` env protocol (OpenEnv-compatible by design),
so it slots into the framework's existing data path without translation.
PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2
(32B QwQ); production-tested on real distributed runs.
### Why Monarch (not TorchForge or TorchTitan as a top-level)
- **Monarch is what's actually shipping** from Meta's agentic stack. v0.4.1
is stable, v0.5 dev daily. BSD-3.
- **TorchForge is paused** per its own repo banner. We document it
(research/03) but don't depend on it.
- **TorchTitan is a transitive dep** of PRIME-RL already, so we get its
benefits without needing to build a direct integration. If we wanted a
TorchTitan-only path, it would be redundant with PRIME-RL.
- **torchchat is inference-only** and doesn't fit the training-framework
conversation.
Monarch's role in our stack: **the actor mesh that hosts trainer/generator/
rewarder/judge actors**. PRIME-RL's three-actor split (trainer, generator,
rewarder) maps naturally onto Monarch primitives.
## Consequences
### Accepted
- `composer_replication/recipes/prime_rl/` directory:
- `prime_rl_recipe.md` — integration recipe (parallel to TRL Recipe A,
VeRL Recipe B)
- `composer_loss.py` — the 3-channel loss adapted to PRIME-RL's
`LossInputs` struct (~200-300 LOC)
- `prime_rl_config.yaml` — example PRIME-RL config wiring our loss in
- `composer_replication/recipes/monarch/` directory:
- `monarch_actor_layout.md` — design doc for the actor mesh
- `actors.py` — placeholder Monarch actor definitions (skeleton only;
full integration is post-replication)
- New optional dependencies in `pyproject.toml`:
- `[prime-rl]` extra: `prime-rl>=0.5`
- `[monarch]` extra: `monarch>=0.4.1`
- `docs/V3_SUBSTRATE_COVERAGE.md` updated to reflect the new additions.
### Three-recipe production matrix
| User scenario | Recommended recipe |
|---|---|
| Quick start, single-cluster, ≤7B | TRL Recipe A |
| Production multi-node, ≤32B | VeRL Recipe B |
| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) |
| Coordination-heavy multi-actor RL | Monarch + any of the above |
### Trade-offs explicitly accepted
- **Three RL frameworks is a maintenance burden.** We accept this because
no single one covers all the user scenarios above. The framework's
contribution is the 3-channel loss + the trace-replay channel, expressed
in three different framework idioms. Each recipe is ~200-300 LOC; total
triplication tax ~700 LOC vs. picking one framework.
- **Monarch is BSD-3 not MIT.** The framework is MIT; users opting in to
Monarch take on its license. Documented in pyproject.toml's optional
extras.
- **PRIME-RL's API may evolve.** The `LossInputs` struct is currently the
contract; if PRIME-RL stabilizes a different shape we'd need to bump.
Pin to v0.5.x in our optional extras.
## Source
`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` (2026-05-26 subagent recon,
primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release
metadata).