File size: 5,595 Bytes
b266c31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL

**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13

## Context

The brief's V3 clause names six substrates: **monarch, torchforge,
openenv, VeRL, TRL** (plus DiLoCo). Cross-model review (Wave 11) flagged
that V3 was thin on the RL-framework side: TRL has working code, VeRL has
a config skeleton, and Monarch/TorchForge/OpenEnv are research-only.

User's 2026-05-26 expansion: *"see if there are other frameworks that are
more popular that we could try to use. meta's pytorch agentic stack
components are something that I'd like to explore."*

`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` audited:
- 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory,
  DeepSpeed-Chat
- 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat

## Options considered

| Framework | License | GRPO/DAPO? | Custom-loss extension | Verdict |
|---|---|---|---|---|
| OpenRLHF | Apache-2 | ✅ DAPO | Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC) | Strong but heavyweight |
| **PRIME-RL** | **Apache-2** | **✅ GRPO + DAPO** | **First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC)** | **Chosen** |
| NeMo-Aligner | Apache-2 | ❌ no GRPO/DAPO | n/a | Reject |
| Unsloth | Apache-2 | TRL patcher | Closed `unsloth_zoo` loss kernels — unhookable | Reject |
| LLaMA-Factory | Apache-2 | ❌ delegates to EasyR1 | n/a | Reject |
| DeepSpeed-Chat | Apache-2 | ❌ PPO+DPO only | feature-stale since 2023 | Reject |

| Meta stack | License | Active? | Role |
|---|---|---|---|
| **Monarch** | **BSD-3** | **✅ v0.4.1 stable, v0.5 dev** | **Actor mesh — coordination layer for any SPMD trainer** |
| TorchTitan | BSD-3 | ✅ active | Distributed-training stack (already a transitive dep of PRIME-RL) |
| TorchForge | BSD-3 | ❌ paused | Patterns only, per repo banner |
| torchchat | BSD-3 | active | Inference only — out of scope |

## Decision

**Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the
agentic-stack coordination layer.**

### Why PRIME-RL

PRIME-RL ships a **first-class `CustomLossConfig` with an `import_path`**
that lets us drop in a Python function returning a tensor. The config
exposes a `LossInputs` struct with exactly the tensors we need:
`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
`advantages`, `loss_mask`. This is **the cleanest possible extension
point for a 3-channel loss** — no fork, no Trainer subclass, no monkey-
patching.

It also uses the `verifiers` env protocol (OpenEnv-compatible by design),
so it slots into the framework's existing data path without translation.

PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2
(32B QwQ); production-tested on real distributed runs.

### Why Monarch (not TorchForge or TorchTitan as a top-level)

- **Monarch is what's actually shipping** from Meta's agentic stack. v0.4.1
  is stable, v0.5 dev daily. BSD-3.
- **TorchForge is paused** per its own repo banner. We document it
  (research/03) but don't depend on it.
- **TorchTitan is a transitive dep** of PRIME-RL already, so we get its
  benefits without needing to build a direct integration. If we wanted a
  TorchTitan-only path, it would be redundant with PRIME-RL.
- **torchchat is inference-only** and doesn't fit the training-framework
  conversation.

Monarch's role in our stack: **the actor mesh that hosts trainer/generator/
rewarder/judge actors**. PRIME-RL's three-actor split (trainer, generator,
rewarder) maps naturally onto Monarch primitives.

## Consequences

### Accepted

- `composer_replication/recipes/prime_rl/` directory:
  - `prime_rl_recipe.md` — integration recipe (parallel to TRL Recipe A,
    VeRL Recipe B)
  - `composer_loss.py` — the 3-channel loss adapted to PRIME-RL's
    `LossInputs` struct (~200-300 LOC)
  - `prime_rl_config.yaml` — example PRIME-RL config wiring our loss in
- `composer_replication/recipes/monarch/` directory:
  - `monarch_actor_layout.md` — design doc for the actor mesh
  - `actors.py` — placeholder Monarch actor definitions (skeleton only;
    full integration is post-replication)
- New optional dependencies in `pyproject.toml`:
  - `[prime-rl]` extra: `prime-rl>=0.5`
  - `[monarch]` extra: `monarch>=0.4.1`
- `docs/V3_SUBSTRATE_COVERAGE.md` updated to reflect the new additions.

### Three-recipe production matrix

| User scenario | Recommended recipe |
|---|---|
| Quick start, single-cluster, ≤7B | TRL Recipe A |
| Production multi-node, ≤32B | VeRL Recipe B |
| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) |
| Coordination-heavy multi-actor RL | Monarch + any of the above |

### Trade-offs explicitly accepted

- **Three RL frameworks is a maintenance burden.** We accept this because
  no single one covers all the user scenarios above. The framework's
  contribution is the 3-channel loss + the trace-replay channel, expressed
  in three different framework idioms. Each recipe is ~200-300 LOC; total
  triplication tax ~700 LOC vs. picking one framework.
- **Monarch is BSD-3 not MIT.** The framework is MIT; users opting in to
  Monarch take on its license. Documented in pyproject.toml's optional
  extras.
- **PRIME-RL's API may evolve.** The `LossInputs` struct is currently the
  contract; if PRIME-RL stabilizes a different shape we'd need to bump.
  Pin to v0.5.x in our optional extras.

## Source

`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` (2026-05-26 subagent recon,
primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release
metadata).