composer-replication-framework / docs /adrs /ADR-006-rl-frameworks.md

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 12 days ago

preview code

raw

history blame contribute delete

5.6 kB

ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL

Status: Accepted Date: 2026-05-26 Wave: 13

Context

The brief's V3 clause names six substrates: monarch, torchforge, openenv, VeRL, TRL (plus DiLoCo). Cross-model review (Wave 11) flagged that V3 was thin on the RL-framework side: TRL has working code, VeRL has a config skeleton, and Monarch/TorchForge/OpenEnv are research-only.

User's 2026-05-26 expansion: "see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."

docs/research/RL_FRAMEWORKS_LANDSCAPE.md audited:

6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory, DeepSpeed-Chat
4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat

Options considered

Framework	License	GRPO/DAPO?	Custom-loss extension	Verdict
OpenRLHF	Apache-2	✅ DAPO	Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC)	Strong but heavyweight
PRIME-RL	Apache-2	✅ GRPO + DAPO	First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC)	Chosen
NeMo-Aligner	Apache-2	❌ no GRPO/DAPO	n/a	Reject
Unsloth	Apache-2	TRL patcher	Closed `unsloth_zoo` loss kernels — unhookable	Reject
LLaMA-Factory	Apache-2	❌ delegates to EasyR1	n/a	Reject
DeepSpeed-Chat	Apache-2	❌ PPO+DPO only	feature-stale since 2023	Reject

Meta stack	License	Active?	Role
Monarch	BSD-3	✅ v0.4.1 stable, v0.5 dev	Actor mesh — coordination layer for any SPMD trainer
TorchTitan	BSD-3	✅ active	Distributed-training stack (already a transitive dep of PRIME-RL)
TorchForge	BSD-3	❌ paused	Patterns only, per repo banner
torchchat	BSD-3	active	Inference only — out of scope

Decision

Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the agentic-stack coordination layer.

Why PRIME-RL

PRIME-RL ships a first-class CustomLossConfig with an import_path that lets us drop in a Python function returning a tensor. The config exposes a LossInputs struct with exactly the tensors we need: trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask. This is the cleanest possible extension point for a 3-channel loss — no fork, no Trainer subclass, no monkey- patching.

It also uses the verifiers env protocol (OpenEnv-compatible by design), so it slots into the framework's existing data path without translation.

PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2 (32B QwQ); production-tested on real distributed runs.

Why Monarch (not TorchForge or TorchTitan as a top-level)

Monarch is what's actually shipping from Meta's agentic stack. v0.4.1 is stable, v0.5 dev daily. BSD-3.
TorchForge is paused per its own repo banner. We document it (research/03) but don't depend on it.
TorchTitan is a transitive dep of PRIME-RL already, so we get its benefits without needing to build a direct integration. If we wanted a TorchTitan-only path, it would be redundant with PRIME-RL.
torchchat is inference-only and doesn't fit the training-framework conversation.

Monarch's role in our stack: the actor mesh that hosts trainer/generator/ rewarder/judge actors. PRIME-RL's three-actor split (trainer, generator, rewarder) maps naturally onto Monarch primitives.

Consequences

Accepted

composer_replication/recipes/prime_rl/ directory:
- prime_rl_recipe.md — integration recipe (parallel to TRL Recipe A, VeRL Recipe B)
- composer_loss.py — the 3-channel loss adapted to PRIME-RL's LossInputs struct (~200-300 LOC)
- prime_rl_config.yaml — example PRIME-RL config wiring our loss in
composer_replication/recipes/monarch/ directory:
- monarch_actor_layout.md — design doc for the actor mesh
- actors.py — placeholder Monarch actor definitions (skeleton only; full integration is post-replication)
New optional dependencies in pyproject.toml:
- [prime-rl] extra: prime-rl>=0.5
- [monarch] extra: monarch>=0.4.1
docs/V3_SUBSTRATE_COVERAGE.md updated to reflect the new additions.

Three-recipe production matrix

User scenario	Recommended recipe
Quick start, single-cluster, ≤7B	TRL Recipe A
Production multi-node, ≤32B	VeRL Recipe B
Decentralized / DiLoCo-shape, any size	PRIME-RL recipe (NEW)
Coordination-heavy multi-actor RL	Monarch + any of the above

Trade-offs explicitly accepted

Three RL frameworks is a maintenance burden. We accept this because no single one covers all the user scenarios above. The framework's contribution is the 3-channel loss + the trace-replay channel, expressed in three different framework idioms. Each recipe is ~200-300 LOC; total triplication tax ~700 LOC vs. picking one framework.
Monarch is BSD-3 not MIT. The framework is MIT; users opting in to Monarch take on its license. Documented in pyproject.toml's optional extras.
PRIME-RL's API may evolve. The LossInputs struct is currently the contract; if PRIME-RL stabilizes a different shape we'd need to bump. Pin to v0.5.x in our optional extras.

Source

docs/research/RL_FRAMEWORKS_LANDSCAPE.md (2026-05-26 subagent recon, primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release metadata).