File size: 33,202 Bytes

# RL Post-Training Frameworks Landscape & Meta PyTorch Stack Audit

> **Generated:** 2026-05-25
> **Scope:** Audit of RL post-training frameworks beyond TRL+VeRL plus Meta's PyTorch agentic stack components, with a recommendation of two additions to the Composer Replication Framework.
> **Feeds:** ADR-006 (Algorithm-substrate selection)
> **Companion docs:** `~/wiki/research/post-training-framework/04-verl-trl.md`, `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md`, `~/wiki/research/post-training-framework/02-diloco-family.md`

---

## TL;DR — Recommendation

| Slot | Pick | Why |
|---|---|---|
| **RL framework #3 (after TRL, VeRL)** | **PRIME-RL (PrimeIntellect-ai/prime-rl)** | First-class `CustomLossConfig` extension point (`trainer.loss.type=custom` + `import_path`) — the cleanest place we have to drop our **3-channel loss (RLVR + hint-distill + trace-replay)** without forking. Already uses the `verifiers` env protocol that bridges to OpenEnv. Async, decentralized substrate. Apache-2.0. INTELLECT-2 production receipts. |
| **Infra component (Meta stack)** | **Monarch (`meta-pytorch/monarch`)** as the actor-mesh control plane; **TorchTitan** is *also* tracked as the FSDP2/TP/PP training core but is already the trainer inside both PRIME-RL and TorchForge, so we adopt it transitively. The single net-new dependency is **Monarch**. | Monarch is the only Meta-stack component that is (a) actively shipped (v0.4 GA, v0.5 dev, weekly wheels), (b) decoupled from the now-paused TorchForge, and (c) able to host *any* SPMD trainer (TRL, VeRL, PRIME-RL) as an `ActorMesh`. BSD-3. Replaces Ray when our v0.2 lands. |

**What we do NOT add:**
- OpenRLHF — strong production framework (v0.9.10, 9.3K★, supports DAPO) but its custom-loss path requires modifying `openrlhf/models/loss.py` + a `Trainer` subclass. Strictly worse extension story than PRIME-RL for our specific need (3-channel loss).
- NeMo-Aligner — no GRPO, no DAPO, heavy NeMo/Megatron dependency. Wrong shape.
- Unsloth — TRL wrapper, RL kernels live in closed `unsloth_zoo`. We'd have to fork.
- LLaMA-Factory — TRL wrapper, no GRPO/DAPO (delegates to EasyR1).
- DeepSpeed-Chat — effectively unmaintained for new RL algos since Aug 2023; PPO/DPO only.
- TorchForge — Meta has marked the repo "development paused, consolidating into TorchTitan." Borrow patterns; do not depend on it.
- torchchat — inference / local deployment only; no training. Out of scope.

---

## Table of Contents

1. [Audit Methodology](#1-audit-methodology)
2. [RL Framework Audit](#2-rl-framework-audit)
   1. [OpenRLHF](#21-openrlhf)
   2. [PRIME-RL](#22-prime-rl)
   3. [NeMo-Aligner](#23-nemo-aligner)
   4. [Unsloth (RL)](#24-unsloth-rl)
   5. [LLaMA-Factory](#25-llama-factory)
   6. [DeepSpeed-Chat](#26-deepspeed-chat)
3. [Meta PyTorch Agentic Stack — Infra vs Training Split](#3-meta-pytorch-agentic-stack)
   1. [Monarch (coordination/infra)](#31-monarch)
   2. [TorchTitan (training stack)](#32-torchtitan)
   3. [TorchForge (paused)](#33-torchforge)
   4. [torchchat (out of scope)](#34-torchchat)
4. [Comparison Matrix](#4-comparison-matrix)
5. [Recommendation Rationale](#5-recommendation-rationale)
6. [Integration Sketches](#6-integration-sketches)
7. [Sources](#7-sources)

---

## 1. Audit Methodology

For each framework, we capture five fields that determine whether it can host the Composer Replication Framework's three-channel loss (RLVR + hint-distill + trace-replay) on our existing OpenEnv-compatible TRL data path:

1. **Repo + license + last commit + maturity** — primary GitHub source, license grade for redistribution, recency, and whether the project is *production*, *research*, or *archived*.
2. **Algorithm coverage** — does it ship GRPO and DAPO out of the box? (DAPO matters because Composer-style training inherits its decoupled clip + dynamic sampling fixes for length and std biases.)
3. **Custom-loss extension point** — concrete file/class/config where a custom 3-channel loss can be plugged. We strongly prefer a stable public hook over forking.
4. **Integration cost** — rough lines of code needed for a `Recipe` doc + a skeleton `Trainer` subclass that runs end-to-end on a small env.
5. **OpenEnv data-path fit** — does it already consume the OpenEnv contract (typed `reset`/`step`/`close`, MCP tool-calling) directly, or do we have to write a shim?

Primary sources: each repo's `README.md`, official releases page, and DeepWiki audits (where indexed). Secondary checks: PyPI release timelines for Meta packages.

---

## 2. RL Framework Audit

### 2.1 OpenRLHF

| Field | Value |
|---|---|
| **Repo** | https://github.com/OpenRLHF/OpenRLHF |
| **License** | Apache-2.0 |
| **Stars / contributors** | 9,312 ★ / 90 contributors |
| **Latest release** | v0.9.10, 2026-04-04 |
| **Last push** | 2026-04-05 |
| **Maturity** | **Production** — used in many public RLHF runs since 2023; tagline "An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)" |
| **Algorithms** | PPO, GRPO, **DAPO** (release notes; advertised as a primary feature in v0.9.x), REINFORCE++, REINFORCE++-baseline, RLOO, GSPO, Async RL, TIS (truncated importance sampling) |
| **Custom-loss extension point** | `openrlhf/models/loss.py` — `PolicyLoss`, `DPOLoss`, `SFTLoss`, `PairWiseLoss`, `LogExpLoss` are concrete `nn.Module`s. To add a 3-channel loss you would (a) add a new `nn.Module` (e.g. `ThreeChannelLoss`) here, then (b) subclass the relevant `Trainer` (e.g. `PPOTrainer` / a new GRPO-derived trainer) and replace `self.loss_fn`. There is **no config-driven custom-loss hook** equivalent to PRIME-RL's `CustomLossConfig` — you fork or vendor. |
| **Integration cost** | Higher than PRIME-RL. Estimated **~400–600 LOC**: ~150 LOC for a `ThreeChannelLoss` module, ~200 LOC for a `ComposerGRPOTrainer` subclass that routes the three signals (RLVR scalar, hint-distill teacher logprobs, trace-replay teacher logits), ~50 LOC for a `Recipe` doc, plus reward-fn glue. |
| **Data-path fit** | OpenRLHF's input is HF chat templates + a Python reward function or a remote reward URL (`--reward.remote_url`, `--train.agent_func_path`). It does **not** speak the OpenEnv `reset/step` protocol natively, but our existing OpenEnv→TRL adapter could be reused as a callable behind `agent_func_path`. **Medium** lift to wire OpenEnv. |

**Verdict:** Strong, mature, well-funded codebase with the *most* complete algorithm coverage of any candidate. Loses to PRIME-RL only because PRIME-RL has a first-class config-driven custom-loss hook that fits our exact need, and PRIME-RL already has the `verifiers`/OpenEnv shape baked into the orchestrator. We keep OpenRLHF on the radar as a fallback substrate if PRIME-RL's decentralized story is overkill for v0.1.

---

### 2.2 PRIME-RL

| Field | Value |
|---|---|
| **Repo** | https://github.com/PrimeIntellect-ai/prime-rl |
| **License** | Apache-2.0 |
| **Stars / contributors** | 1,398 ★ / 60 contributors |
| **Latest release** | v0.5.0, 2026-03-30 |
| **Last push** | 2026-05-25 (active today) |
| **Maturity** | **Production-research hybrid** — substrate behind INTELLECT-1/2 multi-DC runs; tagline "Async RL Training at Scale". Decentralized DiLoCo-shape compute is its differentiator. |
| **Algorithms** | **GRPO**, GSPO, on-policy distillation with a teacher model. `default_loss_fn` = DPPO + KL (a GRPO variant; similar lineage to DAPO's decoupled-clip idea but the upstream "DAPO" label is not used verbatim). |
| **Custom-loss extension point** | **Best in class.** `src/prime_rl/trainer/rl/loss.py` exposes a `LossInputs`/`LossOutputs` interface and `setup_loss_fn` resolves a config: `trainer.loss.type = "custom"` + `trainer.loss.import_path = "your_pkg.your_module.your_loss_fn"` + optional kwargs. The custom function receives `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask` — i.e., the exact tensor inputs needed for a 3-channel loss (RLVR uses `advantages`, hint-distill uses `teacher_logprobs`, trace-replay can be threaded through `kwargs` as a precomputed reference). |
| **Integration cost** | **Lowest.** Estimated **~200–300 LOC total**: ~120 LOC for a `composer_three_channel_loss` function in our package + ~30 LOC of config (`recipes/composer_v0.toml`), ~80 LOC `Recipe` doc. No subclassing required for the loss. A small adapter is needed if we precompute the trace-replay teacher distribution outside the `LossInputs` struct. |
| **Data-path fit** | **Already aligned.** PRIME-RL's orchestrator consumes `verifiers` environments via `vf.EnvServer`. The OpenEnv ↔ verifiers shim is a known small adapter (the `verifiers` library is the Hub-side env runner that OpenEnv's TRL guide already uses). Our existing OpenEnv-compatible TRL data path drops in with a thin wrapper. |

**Verdict:** Best fit for the framework. The combination of (i) config-driven custom loss with the right tensor signatures already present, (ii) verifiers/OpenEnv shape, (iii) decentralized async training that maps to our DiLoCo plans, makes PRIME-RL the substrate of choice for v0.1. **Recommended addition #1.**

---

### 2.3 NeMo-Aligner

| Field | Value |
|---|---|
| **Repo** | https://github.com/NVIDIA/NeMo-Aligner |
| **License** | Apache-2.0 |
| **Maturity** | **Research-leaning production** — NVIDIA-maintained, tied to NeMo/Megatron-LM. Advertised as "early stages of development" in its own README. |
| **Algorithms** | PPO, REINFORCE, RS (Rejection Sampling), DPO, RPO. **No GRPO. No DAPO.** |
| **Custom-loss extension point** | `loss_func` method on Megatron model classes (e.g. `MegatronGPTDPOModel.loss_func`). Requires NeMo model-class subclassing and Megatron-LM familiarity. |
| **Integration cost** | High. Estimated **~800–1,200 LOC** including .nemo conversion of HF weights, Megatron model wrapping, custom Megatron `loss_func`, and a recipe. Plus the operational cost of running on Megatron-LM (Triton kernels, NeMo container). |
| **Data-path fit** | JSONL only; no OpenEnv. We'd write a full env adapter. |

**Verdict:** Wrong shape. No GRPO/DAPO and tightly bound to the NeMo ecosystem. Only relevant if we ever need NVIDIA-supported large-scale Megatron RL, which we don't for the Composer Replication v0.1/v0.2 horizon. **Reject.**

---

### 2.4 Unsloth (RL)

| Field | Value |
|---|---|
| **Repo** | https://github.com/unslothai/unsloth |
| **License** | Apache-2.0 (per public README; not surfaced by DeepWiki snapshot but well-known) |
| **Maturity** | **Production** for SFT and LoRA/QLoRA; **research/preview** for RL — RL support shipped in 2025 as a TRL patcher. |
| **Algorithms** | Wraps TRL → inherits TRL's GRPO; loss-type switch supports `"grpo"`, `"bnpo"`, `"dr_grpo"`, `"dapo"`, `"cispo"`. So **GRPO and DAPO are both available** through the patched-TRL path. |
| **Custom-loss extension point** | Problematic. The actual loss kernels live in `unsloth_zoo` (a *separate* compiled dependency). The patcher (`patch_trl_rl_trainers()`) generates modified TRL trainer classes via `exec()` from string templates. To add a new loss type you would have to (a) modify or fork `unsloth_zoo` to add a kernel, (b) extend `RL_REPLACEMENTS`, and (c) extend the `compute_loss()` switch in the patcher template. **There is no public Python subclass hook that survives the patching.** |
| **Integration cost** | Very high if we want our own loss. Forking `unsloth_zoo` defeats the purpose of using Unsloth (which is the optimized kernels). Estimated ~1,000+ LOC plus an external repo to maintain. |
| **Data-path fit** | TRL-shaped, so OpenEnv via TRL is fine — but only for *stock* TRL losses. Our 3-channel loss does not survive Unsloth's patching. |

**Verdict:** Excellent for memory-efficient SFT and stock-GRPO LoRA. Wrong tool for a custom loss. **Reject** as the substrate; we may still use it as an *optional* QLoRA accelerator inside a stock-GRPO ablation run.

---

### 2.5 LLaMA-Factory

| Field | Value |
|---|---|
| **Repo** | https://github.com/hiyouga/LLaMA-Factory |
| **License** | Apache-2.0 |
| **Maturity** | **Production** for breadth (50+ model families, SFT/DPO/PPO recipes), but RL is a thin TRL wrapper. |
| **Algorithms** | PPO, DPO, KTO, ORPO, SimPO via `Custom*Trainer` subclasses of the corresponding `trl.*Trainer` classes. **No GRPO. No DAPO** in the repo itself; the README points to **EasyR1** (an external GRPO framework) for those. |
| **Custom-loss extension point** | `compute_preference_loss` switch on `CustomDPOTrainer` (selects `sigmoid` / `hinge` / `ipo` / `kto_pair` / `orpo` / `simpo`). For PPO, you would subclass `CustomPPOTrainer` → which is `trl.PPOTrainer`. Effectively the same extension story as plain TRL, with a configuration layer on top. |
| **Integration cost** | Moderate, ~400 LOC, but you are essentially using TRL through one extra layer. |
| **Data-path fit** | Text/dataset-shaped, not OpenEnv-aware. Same OpenEnv-via-TRL story. |

**Verdict:** Useful as a multi-model SFT laboratory but does not move the ball for our RL-side requirements. **Reject** as substrate; we already have TRL.

---

### 2.6 DeepSpeed-Chat

| Field | Value |
|---|---|
| **Repo** | https://github.com/deepspeedai/DeepSpeedExamples (the `applications/DeepSpeed-Chat/` subtree) |
| **License** | Apache-2.0 |
| **Maturity** | **Effectively stale.** The README's "Latest News" cuts off in August 2023. CI patches in 2025 (e.g., #6982, #7015, #7052) are dependency-pinning fixes, not feature work. The roadmap to "generalize DeepSpeed-RLHF abstraction for a wider range of RL algorithms" has not landed. |
| **Algorithms** | PPO (3-stage RLHF) + DPO. **No GRPO. No DAPO.** |
| **Custom-loss extension point** | `DeepSpeedPPOTrainer.train_rlhf` / `actor_loss_fn` / `critic_loss_fn`. Editable but not config-hooked. |
| **Integration cost** | Moderate, but you inherit a frozen architecture. ~500 LOC. |
| **Data-path fit** | Prompt-dataset-shaped; no OpenEnv. |

**Verdict:** Pioneering for its time, no longer competitive on algorithm coverage. **Reject.**

---

## 3. Meta PyTorch Agentic Stack — Infra vs Training Split

The brief asked specifically to **distinguish coordination/infra from training-stack** components. The answer is:

| Component | Layer | Status (May 2026) | In our framework? |
|---|---|---|---|
| **Monarch** (`meta-pytorch/monarch`) | **Coordination / Infra** — actor mesh, RDMA data plane, supervision trees | **Active.** v0.4 GA (2026-03-26), v0.5 dev wheels daily, BSD-3 | **Yes — recommended addition.** |
| **TorchTitan** (`pytorch/torchtitan`) | **Training stack** — FSDP2 / TP / PP / CP / float8 / MXFP8 | **Active.** BSD-3, "extensive development". Has an experimental GRPO recipe (`experiments/rl/simple_grpo_sum_digits.py`) on Monarch. | **Indirectly** — already the trainer inside PRIME-RL and TorchForge. We adopt it transitively, not as a direct dependency. |
| **TorchForge** (`meta-pytorch/forge`) | RL post-training library | **Development paused** per the repo banner; consolidating into TorchTitan. ~685★. | **Pattern reference only.** Lift the Generator/Trainer/Rewarder *shape* but do not depend on the package. |
| **torchchat** (`pytorch/torchchat`) | **Inference / local deployment** | Active for its own scope, but: not a training framework; no RL surface. | **Out of scope.** |
| **OpenEnv** (`meta-pytorch/OpenEnv`) | Environment standard (covered separately) | Active. Already a v0 dependency of the framework. | Already adopted. |

### 3.1 Monarch

| Field | Value |
|---|---|
| **Repo** | https://github.com/meta-pytorch/monarch |
| **License** | BSD-3-Clause |
| **PyPI** | `torchmonarch`; v0.4.1 stable (2026-04-08), v0.5.0 dev wheels published daily through 2026-05-05 |
| **Maturity** | **Experimental but actively shipped.** "Currently in an experimental stage" per the repo's own status note, but with a functioning K8s operator, weekly wheels, ProcessMesh/ActorMesh APIs stable enough for VeRL backend experiments. |
| **Role in our stack** | **Pure coordination/infra.** It does not train models. It hosts whatever trainer you bring (TRL, VeRL, PRIME-RL, TorchTitan) as `Actor` subclasses on a `ProcMesh`. The `monarch.spmd.SPMDActor` automatically configures `RANK`/`LOCAL_RANK`/`WORLD_SIZE` for any PyTorch-distributed script — i.e., we can lift our existing TRL or PRIME-RL workers into Monarch with minimal change. |
| **Key abstractions** | `ProcMesh` (processes × hosts × GPUs), `ActorMesh` (typed actors with `@endpoint` methods), supervision trees, RDMA buffers, distributed tensors / DTensor integration. Underlying runtime: `hyperactor` (Rust). |
| **Why over Ray** | Tighter PyTorch/DTensor integration; explicit RDMA data plane (Ray uses object store + standard networking); single-controller mental model maps directly to RL post-training (one controller orchestrates Generator + Trainer + Rewarder + Env actors). |
| **Integration cost into Composer Replication** | **~300 LOC + ops**: (a) wrap our PRIME-RL trainer as an `SPMDActor`; (b) wrap our vLLM rollout server as an `Actor` with an `@endpoint generate(prompts)` method; (c) write a single controller script that creates a `ProcMesh`, spawns both meshes, and shuttles `DataProto`-shaped messages; (d) Recipe doc. The ops cost is the harder half — Monarch's K8s operator is new (v0.2.0+). |
| **Risk** | Pre-1.0; API churn possible (e.g., `KubernetesJob.add_mesh` signature changed in v0.5). Mitigation: pin to `torchmonarch==0.4.1` for v0.2 of our framework. |

### 3.2 TorchTitan

| Field | Value |
|---|---|
| **Repo** | https://github.com/pytorch/torchtitan |
| **License** | BSD-3-Clause |
| **Maturity** | **Active development** for pretraining; **experimental** for RL. The GRPO experiment (`torchtitan/experiments/rl/simple_grpo_sum_digits.py`) is in `experiments/`, which the repo explicitly disclaims as removable. |
| **Role** | **Training stack only.** Provides FSDP2 (per-parameter sharding), Tensor Parallel (incl. async TP), Pipeline Parallel (zero-bubble), Context Parallel (long-context), `torch.compile`, Float8, MXFP8, DDP, HSDP. |
| **OpenEnv-aware?** | No, but the experimental `RLTrainer` integrates `vLLM` + Monarch actors, which is the same shape PRIME-RL uses. |
| **Why we don't add it directly** | **PRIME-RL already uses TorchTitan-equivalent FSDP2 internals**, and TorchForge's training core was TorchTitan. Adding TorchTitan as a *direct* dependency would mean writing our own RL loop on top of it — that's TorchForge's job, and Meta paused exactly that effort. The right move is to depend on PRIME-RL, which has battle-tested distributed training patterns equivalent to TorchTitan's, and revisit TorchTitan directly only when we genuinely need its experimental zero-bubble PP or MXFP8 paths. |

### 3.3 TorchForge (Paused)

- Repo banner: **"Development paused — LLM training consolidating in TorchTitan."**
- ~685 ★, 100+ open issues, last meaningful release in early 2026.
- Patterns we should still copy:
  - Generator/Trainer/Rewarder ActorMesh decomposition
  - TorchStore-style RDMA weight broadcast
  - Async toggle between sync PPO-like and fully async off-policy
- **We do not add a TorchForge dependency.** Architectural reference only.

### 3.4 torchchat (Out of Scope)

- Inference / local deployment of LLMs (Eager / `torch.compile` / AOT Inductor / ExecuTorch / mobile).
- No training, no RL.
- Mentioned in the brief for completeness; ruled out cleanly.

---

## 4. Comparison Matrix

### 4.1 RL Frameworks

| Framework | License | Last release | Maturity | GRPO | DAPO | Custom-loss hook | OpenEnv fit | Est. integration LOC |
|---|---|---|---|---|---|---|---|---|
| **TRL** (baseline) | Apache-2.0 | Active | Production | ✅ | partial (tricks land per release) | Subclass `GRPOTrainer.compute_loss` | ✅ native (Oct 2025 OpenEnv guide) | already integrated |
| **VeRL** (baseline) | Apache-2.0 | Active | Production | ✅ | ✅ | `core_algos.py` + worker subclass | shim via Ray dataloader | already skeleton |
| **OpenRLHF** | Apache-2.0 | v0.9.10 (2026-04-04) | Production | ✅ | ✅ | `openrlhf/models/loss.py` + Trainer subclass; **no config hook** | shim via `agent_func_path` | ~400–600 |
| **PRIME-RL** ⭐ | Apache-2.0 | v0.5.0 (2026-03-30) | Prod-research | ✅ | partial (DPPO+KL variant; not labeled DAPO) | **`CustomLossConfig` import_path — first-class** | ✅ via `verifiers` (OpenEnv-compatible) | **~200–300** |
| **NeMo-Aligner** | Apache-2.0 | Active | Research-leaning | ❌ | ❌ | Megatron model `loss_func` | none; JSONL only | ~800–1,200 |
| **Unsloth (RL)** | Apache-2.0 | Active | Production (SFT) / preview (RL) | ✅ (via TRL patch) | ✅ (via TRL patch) | Loss kernels in closed `unsloth_zoo`; effectively unhookable | TRL-shaped | ~1,000+ (forking) |
| **LLaMA-Factory** | Apache-2.0 | Active | Production | ❌ (delegates to EasyR1) | ❌ | TRL `Custom*Trainer` subclass | TRL-shaped | ~400 |
| **DeepSpeed-Chat** | Apache-2.0 | Stale (Aug 2023 features; 2025 only CI fixes) | Effectively maintained-only | ❌ | ❌ | `DeepSpeedPPOTrainer` subclass | none | ~500 |

### 4.2 Meta PyTorch Stack

| Component | Layer | License | Status | In recommendation? |
|---|---|---|---|---|
| **Monarch** ⭐ | Coordination / actor mesh | BSD-3 | Active (v0.4 GA, v0.5 dev) | **Yes** |
| **TorchTitan** | Training stack | BSD-3 | Active; RL experimental | Indirect (via PRIME-RL) |
| **TorchForge** | RL library | BSD-3 | **Paused** | No — patterns only |
| **torchchat** | Inference / deployment | BSD-3 | Active | No — out of scope |
| **OpenEnv** | Environment standard | (Hub) | Active | Already adopted |

---

## 5. Recommendation Rationale

### 5.1 Why PRIME-RL, not OpenRLHF

OpenRLHF is in many ways the safer pick: more stars, more contributors, more algorithm coverage (it explicitly ships DAPO). The deciding factor is **the shape of our custom loss**.

The Composer Replication Framework's signature contribution is the **three-channel reward**:

1. **RLVR** — tests-pass scalar from the OpenEnv environment.
2. **Composer-style hint-distill (SDPO/OPSD)** — the model self-teaches against its own hint-conditioned roll-outs; needs `teacher_logprobs` aligned to the rollout token grid.
3. **Trace-replay multi-teacher PRM** (the novel bit) — N frozen external teachers' precomputed token-level distributions, replayed against the on-policy rollout.

PRIME-RL's `LossInputs` dataclass already exposes exactly the tensors we need:
```
trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask
```
A custom 3-channel loss is roughly:
```python
def composer_three_channel_loss(li: LossInputs, *, hint_weight, replay_weight, replay_logits) -> LossOutputs:
    rlvr = grpo_term(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask)
    hint = kl_term(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)
    replay = kl_term(li.trainer_logprobs, replay_logits, li.loss_mask)
    return LossOutputs(loss=rlvr + hint_weight * hint + replay_weight * replay, ...)
```
We register this with `trainer.loss.type = "custom"` + `import_path` and we're done. No subclassing, no `exec()`-patched template, no Megatron model wrapping.

OpenRLHF would require us to (a) add a `ThreeChannelLoss` `nn.Module` to `openrlhf/models/loss.py`, (b) subclass `PPOTrainer` (or equivalent GRPO trainer) to construct it with the right teacher-logprob plumbing, and (c) carry that fork forward. ~2× the LOC, plus a fork to maintain.

A second factor: PRIME-RL's `verifiers` env protocol is a direct precursor of OpenEnv's wire shape (HTTP/WebSocket env servers, typed observations). Our existing OpenEnv-compatible TRL data path translates with a thin adapter. OpenRLHF's `agent_func_path` is more of an escape hatch than a contract.

A third factor: PRIME-RL was *built for decentralized training* (INTELLECT-1/2). Even though our v0.1 stays on a single cluster, the v0.2 multi-DC story drops in cleanly. OpenRLHF is Ray-on-one-cluster by design.

### 5.2 Why Monarch, not TorchTitan or TorchForge

Among the four Meta-stack components in the brief, only one is both (a) ours to add and (b) genuinely new functionality:

- **TorchForge** is paused — depending on it now is a known dead end.
- **TorchTitan** is already inside PRIME-RL transitively (PRIME-RL uses FSDP2 plus a SHARDCAST weight-broadcast layer that is morally equivalent to what TorchTitan offers). Adding TorchTitan as a *direct* dependency means writing our own RL loop on top of it, which is exactly what TorchForge tried and paused. We get TorchTitan's benefits without owning the integration.
- **torchchat** is for local inference / mobile deployment — out of scope.
- **Monarch** is the unique value: a PyTorch-native actor mesh that lets us replace Ray (PRIME-RL's current orchestration substrate) with something that has explicit RDMA, supervision trees, and ProcMesh/ActorMesh primitives that map directly onto our (Generator, Trainer, Rewarder, EnvServer) topology.

The migration path is incremental:
- **v0.1:** PRIME-RL on Ray (current). Monarch listed as roadmap.
- **v0.2:** Wrap PRIME-RL's Trainer as a `monarch.spmd.SPMDActor`, vLLM Generator as an `Actor` with an `@endpoint generate()`. Switch the orchestrator from `ray.init()` to `this_host().spawn_procs()`.
- Risk-mitigation: pin to `torchmonarch==0.4.1` (the last GA release before v0.5 dev). Keep a Ray fallback path active until v0.2 is stable.

---

## 6. Integration Sketches

### 6.1 PRIME-RL Recipe skeleton

`recipes/composer_v0_prime_rl.toml` (~30 LOC):

```toml
# composer_v0_prime_rl.toml
[model]
name = "Qwen/Qwen3-32B"  # or Kimi-K2.5 when MoE support lands

[data]
env = "swe_bench_lite"   # via verifiers EnvServer; wraps our OpenEnv adapter
batch_size = 64
group_size = 16

[trainer]
algorithm = "grpo"

> **Realised in v0.1 (Wave 17 update):** Wave 14b shipped the PRIME-RL
> recipe at `composer_replication/recipes/prime_rl/prime_rl_config.yaml`
> as **YAML** with a different kwarg surface than the TOML sketch below.
> The actual recipe shape:
>
> ```yaml
> # composer_replication/recipes/prime_rl/prime_rl_config.yaml
> model:
>   base: "Qwen/Qwen2.5-0.5B"
>   attn_implementation: "flash_attention_2"
>   dtype: "bfloat16"
> env:
>   protocol: "verifiers"
>   config: { name: "math/gsm8k", split: "train" }
> loss:
>   custom:
>     import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
>     kwargs:
>       alpha_sdpo:     0.0      # channel 2 deferred in v0
>       beta_dpo:       0.0      # channel 3 out-of-scope for PRIME-RL v0
>       dppo_mask_high: 0.2      # PRIME-RL DPPO convention (NOT textbook PPO)
>       dppo_mask_low:  0.2      #   both must be >= 0 per Field(..., ge=0)
>       adv_tau:        1.0      # advantage normalization
>       kl_tau:         0.04     # KL coefficient
> ```
>
> The realised `loss_fn(inputs, **kwargs)` matches PRIME-RL's
> `LossInputs`/`LossOutputs` interface (read upstream `prime_rl/loss.py`
> for parity verification — Wave 14b's shadow-parity test independently
> restates the formula in
> `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`).
>
> The pre-Wave-14b TOML/`hint_weight`/`replay_weight` sketch below is
> preserved as historical proposal context.

[trainer.loss]
type = "custom"
import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
[trainer.loss.kwargs]
hint_weight = 0.5
replay_weight = 0.25
replay_logits_path = "/data/teachers/precomputed_replay.zarr"

[teacher]
model = "Qwen/Qwen3-32B"  # same as policy = self-teacher for hint-distill
hint_template = "composer.hint_v1"

[orchestrator]
sync_mode = "async"
shardcast = true
```

`composer_replication/recipes/prime_rl/composer_loss.py` (~120 LOC; current Wave 14b
implementation defines `loss_fn(inputs, **kwargs)` rather than the
`composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)` 
signature sketched below):

```python
# composer_replication/recipes/prime_rl/composer_loss.py — sketch only;
# the actual signature evolved during Wave 14b. See module docstring for
# the current `loss_fn` contract.
from prime_rl.trainer.rl.loss import LossInputs, LossOutputs

def composer_three_channel_loss(
    li: LossInputs,
    *,
    hint_weight: float,
    replay_weight: float,
    replay_logits_handle: str,
) -> LossOutputs:
    # 1. RLVR via GRPO surrogate
    rlvr = grpo_surrogate(li.trainer_logprobs, li.inference_logprobs,
                          li.advantages, li.loss_mask)

    # 2. Hint-distill: KL(policy || hint-conditioned teacher)
    hint = masked_kl(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)

    # 3. Trace-replay: KL(policy || precomputed multi-teacher mixture)
    replay = trace_replay_kl(li.trainer_logprobs, replay_logits_handle, li.loss_mask)

    total = rlvr + hint_weight * hint + replay_weight * replay
    return LossOutputs(
        loss=total,
        metrics={"rlvr": rlvr.item(), "hint": hint.item(), "replay": replay.item()},
    )
```

Plus `docs/recipes/composer_v0_prime_rl.md` (~50 LOC) describing data layout, teacher precomputation, and reproducibility hashes.

**Total: ~200 LOC of code + ~30 LOC config + ~50 LOC docs ≈ 280 LOC.**

### 6.2 Monarch wrap-up sketch (v0.2)

```python
# composer_replication/orchestrator/monarch_runner.py  (~120 LOC)
from monarch.actor import Actor, endpoint
from monarch.proc_mesh import this_host, ProcMesh

class TrainerActor(Actor):
    @endpoint
    async def step(self, batch): ...

class GeneratorActor(Actor):
    @endpoint
    async def generate(self, prompts): ...

class RewarderActor(Actor):
    @endpoint
    async def score(self, traj): ...

async def main(cfg):
    train_mesh = await this_host().spawn_procs(TrainerActor, hosts=4, gpus=8)
    gen_mesh   = await this_host().spawn_procs(GeneratorActor, hosts=2, gpus=8)
    rew_mesh   = await this_host().spawn_procs(RewarderActor, hosts=1, gpus=2)

    async for step in range(cfg.steps):
        prompts = await env.batch()
        traj = await gen_mesh.generate.broadcast(prompts)
        rewards = await rew_mesh.score.broadcast(traj)
        await train_mesh.step.broadcast({"traj": traj, "rewards": rewards})
```

**Total: ~120 LOC controller + ~50 LOC ops (K8s operator manifest) + ~80 LOC recipe doc ≈ 250 LOC.**

---

## 7. Sources

### Primary

- **OpenRLHF** — https://github.com/OpenRLHF/OpenRLHF (README, Releases v0.9.10), Apache-2.0; DeepWiki: `openrlhf/models/loss.py`, `agent_func_path`.
- **PRIME-RL** — https://github.com/PrimeIntellect-ai/prime-rl (README, Releases v0.5.0), Apache-2.0; DeepWiki: `src/prime_rl/trainer/rl/loss.py`, `CustomLossConfig`, `LossInputs`/`LossOutputs`, `verifiers` integration.
- **NeMo-Aligner** — https://github.com/NVIDIA/NeMo-Aligner, Apache-2.0; DeepWiki: PPO/REINFORCE/DPO/RPO; `loss_func` on Megatron model classes.
- **Unsloth** — https://github.com/unslothai/unsloth, README RL section; DeepWiki: `patch_trl_rl_trainers()`, `unsloth_zoo` kernels, DAPO loss-type switch.
- **LLaMA-Factory** — https://github.com/hiyouga/LLaMA-Factory, Apache-2.0; DeepWiki: `CustomPPOTrainer`/`CustomDPOTrainer`, EasyR1 reference for GRPO.
- **DeepSpeed-Chat** — https://github.com/deepspeedai/DeepSpeedExamples (`applications/DeepSpeed-Chat/`), Apache-2.0; DeepWiki: 3-stage PPO, DPO; "Latest News" cutoff Aug 2023; 2025 PRs (#6982, #7015, #7052) confirming maintenance-only mode.
- **Monarch** — https://github.com/meta-pytorch/monarch, BSD-3; PyPI `torchmonarch` v0.4.1 (2026-04-08), v0.5.0 dev wheels through 2026-05-05; DeepWiki: `ProcMesh`, `ActorMesh`, `monarch.spmd.SPMDActor`.
- **TorchTitan** — https://github.com/pytorch/torchtitan, BSD-3; DeepWiki: FSDP2/TP/PP/CP, `torchtitan/experiments/rl/simple_grpo_sum_digits.py`, integration with vLLM and Monarch.
- **TorchForge** — https://github.com/meta-pytorch/forge, BSD-3, repo banner "development paused — consolidating in TorchTitan".
- **torchchat** — https://github.com/pytorch/torchchat, BSD-3; DeepWiki: inference-only (eager / `torch.compile` / AOT Inductor / ExecuTorch).

### Companion repository docs (already present)

- `~/wiki/research/post-training-framework/04-verl-trl.md` — VeRL vs TRL deep dive.
- `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` — full Meta-stack survey.
- `~/wiki/research/post-training-framework/02-diloco-family.md` — DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2.
- `~/wiki/projects/composer-replication-framework.md` — current TL;DR and stage plan.

### Notes on accuracy

- "DAPO" labeling: OpenRLHF and Unsloth both advertise DAPO as a first-class loss type; PRIME-RL implements a DAPO-equivalent (decoupled-clip + KL) but uses the internal name `DPPO+KL` in its default loss. For our purposes this is the same family.
- Last-commit dates and release versions are pulled from GitHub release pages (OpenRLHF, PRIME-RL) and PyPI release history (`torchmonarch`).
- Star counts and contributor counts reflect the snapshots returned by web search at the time of writing (May 2026) and will drift; the relative ordering is stable.