composer-replication-framework / docs /adrs /ADR-006-rl-frameworks.md

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 13 days ago

5.6 kB

	# ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL

	Status: Accepted
	Date: 2026-05-26
	Wave: 13

	## Context

	The brief's V3 clause names six substrates: **monarch, torchforge,
	openenv, VeRL, TRL** (plus DiLoCo). Cross-model review (Wave 11) flagged
	that V3 was thin on the RL-framework side: TRL has working code, VeRL has
	a config skeleton, and Monarch/TorchForge/OpenEnv are research-only.

	User's 2026-05-26 expansion: *"see if there are other frameworks that are
	more popular that we could try to use. meta's pytorch agentic stack
	components are something that I'd like to explore."*

	`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` audited:
	- 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory,
	DeepSpeed-Chat
	- 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat

	## Options considered

	\| Framework \| License \| GRPO/DAPO? \| Custom-loss extension \| Verdict \|
	\|---\|---\|---\|---\|---\|
	\| OpenRLHF \| Apache-2 \| ✅ DAPO \| Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC) \| Strong but heavyweight \|
	\| PRIME-RL \| Apache-2 \| ✅ GRPO + DAPO \| First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC) \| Chosen \|
	\| NeMo-Aligner \| Apache-2 \| ❌ no GRPO/DAPO \| n/a \| Reject \|
	\| Unsloth \| Apache-2 \| TRL patcher \| Closed `unsloth_zoo` loss kernels — unhookable \| Reject \|
	\| LLaMA-Factory \| Apache-2 \| ❌ delegates to EasyR1 \| n/a \| Reject \|
	\| DeepSpeed-Chat \| Apache-2 \| ❌ PPO+DPO only \| feature-stale since 2023 \| Reject \|

	\| Meta stack \| License \| Active? \| Role \|
	\|---\|---\|---\|---\|
	\| Monarch \| BSD-3 \| ✅ v0.4.1 stable, v0.5 dev \| Actor mesh — coordination layer for any SPMD trainer \|
	\| TorchTitan \| BSD-3 \| ✅ active \| Distributed-training stack (already a transitive dep of PRIME-RL) \|
	\| TorchForge \| BSD-3 \| ❌ paused \| Patterns only, per repo banner \|
	\| torchchat \| BSD-3 \| active \| Inference only — out of scope \|

	## Decision

	**Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the
	agentic-stack coordination layer.**

	### Why PRIME-RL

	PRIME-RL ships a first-class `CustomLossConfig` with an `import_path`
	that lets us drop in a Python function returning a tensor. The config
	exposes a `LossInputs` struct with exactly the tensors we need:
	`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
	`advantages`, `loss_mask`. This is **the cleanest possible extension
	point for a 3-channel loss** — no fork, no Trainer subclass, no monkey-
	patching.

	It also uses the `verifiers` env protocol (OpenEnv-compatible by design),
	so it slots into the framework's existing data path without translation.

	PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2
	(32B QwQ); production-tested on real distributed runs.

	### Why Monarch (not TorchForge or TorchTitan as a top-level)

	- Monarch is what's actually shipping from Meta's agentic stack. v0.4.1
	is stable, v0.5 dev daily. BSD-3.
	- TorchForge is paused per its own repo banner. We document it
	(research/03) but don't depend on it.
	- TorchTitan is a transitive dep of PRIME-RL already, so we get its
	benefits without needing to build a direct integration. If we wanted a
	TorchTitan-only path, it would be redundant with PRIME-RL.
	- torchchat is inference-only and doesn't fit the training-framework
	conversation.

	Monarch's role in our stack: **the actor mesh that hosts trainer/generator/
	rewarder/judge actors**. PRIME-RL's three-actor split (trainer, generator,
	rewarder) maps naturally onto Monarch primitives.

	## Consequences

	### Accepted

	- `composer_replication/recipes/prime_rl/` directory:
	- `prime_rl_recipe.md` — integration recipe (parallel to TRL Recipe A,
	VeRL Recipe B)
	- `composer_loss.py` — the 3-channel loss adapted to PRIME-RL's
	`LossInputs` struct (~200-300 LOC)
	- `prime_rl_config.yaml` — example PRIME-RL config wiring our loss in
	- `composer_replication/recipes/monarch/` directory:
	- `monarch_actor_layout.md` — design doc for the actor mesh
	- `actors.py` — placeholder Monarch actor definitions (skeleton only;
	full integration is post-replication)
	- New optional dependencies in `pyproject.toml`:
	- `[prime-rl]` extra: `prime-rl>=0.5`
	- `[monarch]` extra: `monarch>=0.4.1`
	- `docs/V3_SUBSTRATE_COVERAGE.md` updated to reflect the new additions.

	### Three-recipe production matrix

	\| User scenario \| Recommended recipe \|
	\|---\|---\|
	\| Quick start, single-cluster, ≤7B \| TRL Recipe A \|
	\| Production multi-node, ≤32B \| VeRL Recipe B \|
	\| Decentralized / DiLoCo-shape, any size \| PRIME-RL recipe (NEW) \|
	\| Coordination-heavy multi-actor RL \| Monarch + any of the above \|

	### Trade-offs explicitly accepted

	- Three RL frameworks is a maintenance burden. We accept this because
	no single one covers all the user scenarios above. The framework's
	contribution is the 3-channel loss + the trace-replay channel, expressed
	in three different framework idioms. Each recipe is ~200-300 LOC; total
	triplication tax ~700 LOC vs. picking one framework.
	- Monarch is BSD-3 not MIT. The framework is MIT; users opting in to
	Monarch take on its license. Documented in pyproject.toml's optional
	extras.
	- PRIME-RL's API may evolve. The `LossInputs` struct is currently the
	contract; if PRIME-RL stabilizes a different shape we'd need to bump.
	Pin to v0.5.x in our optional extras.

	## Source

	`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` (2026-05-26 subagent recon,
	primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release
	metadata).