composer-replication-framework / docs /adrs /ADR-006-rl-frameworks.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31

ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL

Status: Accepted Date: 2026-05-26 Wave: 13

Context

The brief's V3 clause names six substrates: monarch, torchforge, openenv, VeRL, TRL (plus DiLoCo). Cross-model review (Wave 11) flagged that V3 was thin on the RL-framework side: TRL has working code, VeRL has a config skeleton, and Monarch/TorchForge/OpenEnv are research-only.

User's 2026-05-26 expansion: "see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."

docs/research/RL_FRAMEWORKS_LANDSCAPE.md audited:

  • 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory, DeepSpeed-Chat
  • 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat

Options considered

Framework License GRPO/DAPO? Custom-loss extension Verdict
OpenRLHF Apache-2 ✅ DAPO Fork openrlhf/models/loss.py + Trainer subclass (~400-600 LOC) Strong but heavyweight
PRIME-RL Apache-2 ✅ GRPO + DAPO First-class CustomLossConfig with LossInputs struct (~200-300 LOC) Chosen
NeMo-Aligner Apache-2 ❌ no GRPO/DAPO n/a Reject
Unsloth Apache-2 TRL patcher Closed unsloth_zoo loss kernels — unhookable Reject
LLaMA-Factory Apache-2 ❌ delegates to EasyR1 n/a Reject
DeepSpeed-Chat Apache-2 ❌ PPO+DPO only feature-stale since 2023 Reject
Meta stack License Active? Role
Monarch BSD-3 ✅ v0.4.1 stable, v0.5 dev Actor mesh — coordination layer for any SPMD trainer
TorchTitan BSD-3 ✅ active Distributed-training stack (already a transitive dep of PRIME-RL)
TorchForge BSD-3 ❌ paused Patterns only, per repo banner
torchchat BSD-3 active Inference only — out of scope

Decision

Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the agentic-stack coordination layer.

Why PRIME-RL

PRIME-RL ships a first-class CustomLossConfig with an import_path that lets us drop in a Python function returning a tensor. The config exposes a LossInputs struct with exactly the tensors we need: trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask. This is the cleanest possible extension point for a 3-channel loss — no fork, no Trainer subclass, no monkey- patching.

It also uses the verifiers env protocol (OpenEnv-compatible by design), so it slots into the framework's existing data path without translation.

PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2 (32B QwQ); production-tested on real distributed runs.

Why Monarch (not TorchForge or TorchTitan as a top-level)

  • Monarch is what's actually shipping from Meta's agentic stack. v0.4.1 is stable, v0.5 dev daily. BSD-3.
  • TorchForge is paused per its own repo banner. We document it (research/03) but don't depend on it.
  • TorchTitan is a transitive dep of PRIME-RL already, so we get its benefits without needing to build a direct integration. If we wanted a TorchTitan-only path, it would be redundant with PRIME-RL.
  • torchchat is inference-only and doesn't fit the training-framework conversation.

Monarch's role in our stack: the actor mesh that hosts trainer/generator/ rewarder/judge actors. PRIME-RL's three-actor split (trainer, generator, rewarder) maps naturally onto Monarch primitives.

Consequences

Accepted

  • composer_replication/recipes/prime_rl/ directory:
    • prime_rl_recipe.md — integration recipe (parallel to TRL Recipe A, VeRL Recipe B)
    • composer_loss.py — the 3-channel loss adapted to PRIME-RL's LossInputs struct (~200-300 LOC)
    • prime_rl_config.yaml — example PRIME-RL config wiring our loss in
  • composer_replication/recipes/monarch/ directory:
    • monarch_actor_layout.md — design doc for the actor mesh
    • actors.py — placeholder Monarch actor definitions (skeleton only; full integration is post-replication)
  • New optional dependencies in pyproject.toml:
    • [prime-rl] extra: prime-rl>=0.5
    • [monarch] extra: monarch>=0.4.1
  • docs/V3_SUBSTRATE_COVERAGE.md updated to reflect the new additions.

Three-recipe production matrix

User scenario Recommended recipe
Quick start, single-cluster, ≤7B TRL Recipe A
Production multi-node, ≤32B VeRL Recipe B
Decentralized / DiLoCo-shape, any size PRIME-RL recipe (NEW)
Coordination-heavy multi-actor RL Monarch + any of the above

Trade-offs explicitly accepted

  • Three RL frameworks is a maintenance burden. We accept this because no single one covers all the user scenarios above. The framework's contribution is the 3-channel loss + the trace-replay channel, expressed in three different framework idioms. Each recipe is ~200-300 LOC; total triplication tax ~700 LOC vs. picking one framework.
  • Monarch is BSD-3 not MIT. The framework is MIT; users opting in to Monarch take on its license. Documented in pyproject.toml's optional extras.
  • PRIME-RL's API may evolve. The LossInputs struct is currently the contract; if PRIME-RL stabilizes a different shape we'd need to bump. Pin to v0.5.x in our optional extras.

Source

docs/research/RL_FRAMEWORKS_LANDSCAPE.md (2026-05-26 subagent recon, primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release metadata).