# RL Post-Training Frameworks Landscape & Meta PyTorch Stack Audit > **Generated:** 2026-05-25 > **Scope:** Audit of RL post-training frameworks beyond TRL+VeRL plus Meta's PyTorch agentic stack components, with a recommendation of two additions to the Composer Replication Framework. > **Feeds:** ADR-006 (Algorithm-substrate selection) > **Companion docs:** `~/wiki/research/post-training-framework/04-verl-trl.md`, `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md`, `~/wiki/research/post-training-framework/02-diloco-family.md` --- ## TL;DR — Recommendation | Slot | Pick | Why | |---|---|---| | **RL framework #3 (after TRL, VeRL)** | **PRIME-RL (PrimeIntellect-ai/prime-rl)** | First-class `CustomLossConfig` extension point (`trainer.loss.type=custom` + `import_path`) — the cleanest place we have to drop our **3-channel loss (RLVR + hint-distill + trace-replay)** without forking. Already uses the `verifiers` env protocol that bridges to OpenEnv. Async, decentralized substrate. Apache-2.0. INTELLECT-2 production receipts. | | **Infra component (Meta stack)** | **Monarch (`meta-pytorch/monarch`)** as the actor-mesh control plane; **TorchTitan** is *also* tracked as the FSDP2/TP/PP training core but is already the trainer inside both PRIME-RL and TorchForge, so we adopt it transitively. The single net-new dependency is **Monarch**. | Monarch is the only Meta-stack component that is (a) actively shipped (v0.4 GA, v0.5 dev, weekly wheels), (b) decoupled from the now-paused TorchForge, and (c) able to host *any* SPMD trainer (TRL, VeRL, PRIME-RL) as an `ActorMesh`. BSD-3. Replaces Ray when our v0.2 lands. | **What we do NOT add:** - OpenRLHF — strong production framework (v0.9.10, 9.3K★, supports DAPO) but its custom-loss path requires modifying `openrlhf/models/loss.py` + a `Trainer` subclass. Strictly worse extension story than PRIME-RL for our specific need (3-channel loss). - NeMo-Aligner — no GRPO, no DAPO, heavy NeMo/Megatron dependency. Wrong shape. - Unsloth — TRL wrapper, RL kernels live in closed `unsloth_zoo`. We'd have to fork. - LLaMA-Factory — TRL wrapper, no GRPO/DAPO (delegates to EasyR1). - DeepSpeed-Chat — effectively unmaintained for new RL algos since Aug 2023; PPO/DPO only. - TorchForge — Meta has marked the repo "development paused, consolidating into TorchTitan." Borrow patterns; do not depend on it. - torchchat — inference / local deployment only; no training. Out of scope. --- ## Table of Contents 1. [Audit Methodology](#1-audit-methodology) 2. [RL Framework Audit](#2-rl-framework-audit) 1. [OpenRLHF](#21-openrlhf) 2. [PRIME-RL](#22-prime-rl) 3. [NeMo-Aligner](#23-nemo-aligner) 4. [Unsloth (RL)](#24-unsloth-rl) 5. [LLaMA-Factory](#25-llama-factory) 6. [DeepSpeed-Chat](#26-deepspeed-chat) 3. [Meta PyTorch Agentic Stack — Infra vs Training Split](#3-meta-pytorch-agentic-stack) 1. [Monarch (coordination/infra)](#31-monarch) 2. [TorchTitan (training stack)](#32-torchtitan) 3. [TorchForge (paused)](#33-torchforge) 4. [torchchat (out of scope)](#34-torchchat) 4. [Comparison Matrix](#4-comparison-matrix) 5. [Recommendation Rationale](#5-recommendation-rationale) 6. [Integration Sketches](#6-integration-sketches) 7. [Sources](#7-sources) --- ## 1. Audit Methodology For each framework, we capture five fields that determine whether it can host the Composer Replication Framework's three-channel loss (RLVR + hint-distill + trace-replay) on our existing OpenEnv-compatible TRL data path: 1. **Repo + license + last commit + maturity** — primary GitHub source, license grade for redistribution, recency, and whether the project is *production*, *research*, or *archived*. 2. **Algorithm coverage** — does it ship GRPO and DAPO out of the box? (DAPO matters because Composer-style training inherits its decoupled clip + dynamic sampling fixes for length and std biases.) 3. **Custom-loss extension point** — concrete file/class/config where a custom 3-channel loss can be plugged. We strongly prefer a stable public hook over forking. 4. **Integration cost** — rough lines of code needed for a `Recipe` doc + a skeleton `Trainer` subclass that runs end-to-end on a small env. 5. **OpenEnv data-path fit** — does it already consume the OpenEnv contract (typed `reset`/`step`/`close`, MCP tool-calling) directly, or do we have to write a shim? Primary sources: each repo's `README.md`, official releases page, and DeepWiki audits (where indexed). Secondary checks: PyPI release timelines for Meta packages. --- ## 2. RL Framework Audit ### 2.1 OpenRLHF | Field | Value | |---|---| | **Repo** | https://github.com/OpenRLHF/OpenRLHF | | **License** | Apache-2.0 | | **Stars / contributors** | 9,312 ★ / 90 contributors | | **Latest release** | v0.9.10, 2026-04-04 | | **Last push** | 2026-04-05 | | **Maturity** | **Production** — used in many public RLHF runs since 2023; tagline "An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)" | | **Algorithms** | PPO, GRPO, **DAPO** (release notes; advertised as a primary feature in v0.9.x), REINFORCE++, REINFORCE++-baseline, RLOO, GSPO, Async RL, TIS (truncated importance sampling) | | **Custom-loss extension point** | `openrlhf/models/loss.py` — `PolicyLoss`, `DPOLoss`, `SFTLoss`, `PairWiseLoss`, `LogExpLoss` are concrete `nn.Module`s. To add a 3-channel loss you would (a) add a new `nn.Module` (e.g. `ThreeChannelLoss`) here, then (b) subclass the relevant `Trainer` (e.g. `PPOTrainer` / a new GRPO-derived trainer) and replace `self.loss_fn`. There is **no config-driven custom-loss hook** equivalent to PRIME-RL's `CustomLossConfig` — you fork or vendor. | | **Integration cost** | Higher than PRIME-RL. Estimated **~400–600 LOC**: ~150 LOC for a `ThreeChannelLoss` module, ~200 LOC for a `ComposerGRPOTrainer` subclass that routes the three signals (RLVR scalar, hint-distill teacher logprobs, trace-replay teacher logits), ~50 LOC for a `Recipe` doc, plus reward-fn glue. | | **Data-path fit** | OpenRLHF's input is HF chat templates + a Python reward function or a remote reward URL (`--reward.remote_url`, `--train.agent_func_path`). It does **not** speak the OpenEnv `reset/step` protocol natively, but our existing OpenEnv→TRL adapter could be reused as a callable behind `agent_func_path`. **Medium** lift to wire OpenEnv. | **Verdict:** Strong, mature, well-funded codebase with the *most* complete algorithm coverage of any candidate. Loses to PRIME-RL only because PRIME-RL has a first-class config-driven custom-loss hook that fits our exact need, and PRIME-RL already has the `verifiers`/OpenEnv shape baked into the orchestrator. We keep OpenRLHF on the radar as a fallback substrate if PRIME-RL's decentralized story is overkill for v0.1. --- ### 2.2 PRIME-RL | Field | Value | |---|---| | **Repo** | https://github.com/PrimeIntellect-ai/prime-rl | | **License** | Apache-2.0 | | **Stars / contributors** | 1,398 ★ / 60 contributors | | **Latest release** | v0.5.0, 2026-03-30 | | **Last push** | 2026-05-25 (active today) | | **Maturity** | **Production-research hybrid** — substrate behind INTELLECT-1/2 multi-DC runs; tagline "Async RL Training at Scale". Decentralized DiLoCo-shape compute is its differentiator. | | **Algorithms** | **GRPO**, GSPO, on-policy distillation with a teacher model. `default_loss_fn` = DPPO + KL (a GRPO variant; similar lineage to DAPO's decoupled-clip idea but the upstream "DAPO" label is not used verbatim). | | **Custom-loss extension point** | **Best in class.** `src/prime_rl/trainer/rl/loss.py` exposes a `LossInputs`/`LossOutputs` interface and `setup_loss_fn` resolves a config: `trainer.loss.type = "custom"` + `trainer.loss.import_path = "your_pkg.your_module.your_loss_fn"` + optional kwargs. The custom function receives `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask` — i.e., the exact tensor inputs needed for a 3-channel loss (RLVR uses `advantages`, hint-distill uses `teacher_logprobs`, trace-replay can be threaded through `kwargs` as a precomputed reference). | | **Integration cost** | **Lowest.** Estimated **~200–300 LOC total**: ~120 LOC for a `composer_three_channel_loss` function in our package + ~30 LOC of config (`recipes/composer_v0.toml`), ~80 LOC `Recipe` doc. No subclassing required for the loss. A small adapter is needed if we precompute the trace-replay teacher distribution outside the `LossInputs` struct. | | **Data-path fit** | **Already aligned.** PRIME-RL's orchestrator consumes `verifiers` environments via `vf.EnvServer`. The OpenEnv ↔ verifiers shim is a known small adapter (the `verifiers` library is the Hub-side env runner that OpenEnv's TRL guide already uses). Our existing OpenEnv-compatible TRL data path drops in with a thin wrapper. | **Verdict:** Best fit for the framework. The combination of (i) config-driven custom loss with the right tensor signatures already present, (ii) verifiers/OpenEnv shape, (iii) decentralized async training that maps to our DiLoCo plans, makes PRIME-RL the substrate of choice for v0.1. **Recommended addition #1.** --- ### 2.3 NeMo-Aligner | Field | Value | |---|---| | **Repo** | https://github.com/NVIDIA/NeMo-Aligner | | **License** | Apache-2.0 | | **Maturity** | **Research-leaning production** — NVIDIA-maintained, tied to NeMo/Megatron-LM. Advertised as "early stages of development" in its own README. | | **Algorithms** | PPO, REINFORCE, RS (Rejection Sampling), DPO, RPO. **No GRPO. No DAPO.** | | **Custom-loss extension point** | `loss_func` method on Megatron model classes (e.g. `MegatronGPTDPOModel.loss_func`). Requires NeMo model-class subclassing and Megatron-LM familiarity. | | **Integration cost** | High. Estimated **~800–1,200 LOC** including .nemo conversion of HF weights, Megatron model wrapping, custom Megatron `loss_func`, and a recipe. Plus the operational cost of running on Megatron-LM (Triton kernels, NeMo container). | | **Data-path fit** | JSONL only; no OpenEnv. We'd write a full env adapter. | **Verdict:** Wrong shape. No GRPO/DAPO and tightly bound to the NeMo ecosystem. Only relevant if we ever need NVIDIA-supported large-scale Megatron RL, which we don't for the Composer Replication v0.1/v0.2 horizon. **Reject.** --- ### 2.4 Unsloth (RL) | Field | Value | |---|---| | **Repo** | https://github.com/unslothai/unsloth | | **License** | Apache-2.0 (per public README; not surfaced by DeepWiki snapshot but well-known) | | **Maturity** | **Production** for SFT and LoRA/QLoRA; **research/preview** for RL — RL support shipped in 2025 as a TRL patcher. | | **Algorithms** | Wraps TRL → inherits TRL's GRPO; loss-type switch supports `"grpo"`, `"bnpo"`, `"dr_grpo"`, `"dapo"`, `"cispo"`. So **GRPO and DAPO are both available** through the patched-TRL path. | | **Custom-loss extension point** | Problematic. The actual loss kernels live in `unsloth_zoo` (a *separate* compiled dependency). The patcher (`patch_trl_rl_trainers()`) generates modified TRL trainer classes via `exec()` from string templates. To add a new loss type you would have to (a) modify or fork `unsloth_zoo` to add a kernel, (b) extend `RL_REPLACEMENTS`, and (c) extend the `compute_loss()` switch in the patcher template. **There is no public Python subclass hook that survives the patching.** | | **Integration cost** | Very high if we want our own loss. Forking `unsloth_zoo` defeats the purpose of using Unsloth (which is the optimized kernels). Estimated ~1,000+ LOC plus an external repo to maintain. | | **Data-path fit** | TRL-shaped, so OpenEnv via TRL is fine — but only for *stock* TRL losses. Our 3-channel loss does not survive Unsloth's patching. | **Verdict:** Excellent for memory-efficient SFT and stock-GRPO LoRA. Wrong tool for a custom loss. **Reject** as the substrate; we may still use it as an *optional* QLoRA accelerator inside a stock-GRPO ablation run. --- ### 2.5 LLaMA-Factory | Field | Value | |---|---| | **Repo** | https://github.com/hiyouga/LLaMA-Factory | | **License** | Apache-2.0 | | **Maturity** | **Production** for breadth (50+ model families, SFT/DPO/PPO recipes), but RL is a thin TRL wrapper. | | **Algorithms** | PPO, DPO, KTO, ORPO, SimPO via `Custom*Trainer` subclasses of the corresponding `trl.*Trainer` classes. **No GRPO. No DAPO** in the repo itself; the README points to **EasyR1** (an external GRPO framework) for those. | | **Custom-loss extension point** | `compute_preference_loss` switch on `CustomDPOTrainer` (selects `sigmoid` / `hinge` / `ipo` / `kto_pair` / `orpo` / `simpo`). For PPO, you would subclass `CustomPPOTrainer` → which is `trl.PPOTrainer`. Effectively the same extension story as plain TRL, with a configuration layer on top. | | **Integration cost** | Moderate, ~400 LOC, but you are essentially using TRL through one extra layer. | | **Data-path fit** | Text/dataset-shaped, not OpenEnv-aware. Same OpenEnv-via-TRL story. | **Verdict:** Useful as a multi-model SFT laboratory but does not move the ball for our RL-side requirements. **Reject** as substrate; we already have TRL. --- ### 2.6 DeepSpeed-Chat | Field | Value | |---|---| | **Repo** | https://github.com/deepspeedai/DeepSpeedExamples (the `applications/DeepSpeed-Chat/` subtree) | | **License** | Apache-2.0 | | **Maturity** | **Effectively stale.** The README's "Latest News" cuts off in August 2023. CI patches in 2025 (e.g., #6982, #7015, #7052) are dependency-pinning fixes, not feature work. The roadmap to "generalize DeepSpeed-RLHF abstraction for a wider range of RL algorithms" has not landed. | | **Algorithms** | PPO (3-stage RLHF) + DPO. **No GRPO. No DAPO.** | | **Custom-loss extension point** | `DeepSpeedPPOTrainer.train_rlhf` / `actor_loss_fn` / `critic_loss_fn`. Editable but not config-hooked. | | **Integration cost** | Moderate, but you inherit a frozen architecture. ~500 LOC. | | **Data-path fit** | Prompt-dataset-shaped; no OpenEnv. | **Verdict:** Pioneering for its time, no longer competitive on algorithm coverage. **Reject.** --- ## 3. Meta PyTorch Agentic Stack — Infra vs Training Split The brief asked specifically to **distinguish coordination/infra from training-stack** components. The answer is: | Component | Layer | Status (May 2026) | In our framework? | |---|---|---|---| | **Monarch** (`meta-pytorch/monarch`) | **Coordination / Infra** — actor mesh, RDMA data plane, supervision trees | **Active.** v0.4 GA (2026-03-26), v0.5 dev wheels daily, BSD-3 | **Yes — recommended addition.** | | **TorchTitan** (`pytorch/torchtitan`) | **Training stack** — FSDP2 / TP / PP / CP / float8 / MXFP8 | **Active.** BSD-3, "extensive development". Has an experimental GRPO recipe (`experiments/rl/simple_grpo_sum_digits.py`) on Monarch. | **Indirectly** — already the trainer inside PRIME-RL and TorchForge. We adopt it transitively, not as a direct dependency. | | **TorchForge** (`meta-pytorch/forge`) | RL post-training library | **Development paused** per the repo banner; consolidating into TorchTitan. ~685★. | **Pattern reference only.** Lift the Generator/Trainer/Rewarder *shape* but do not depend on the package. | | **torchchat** (`pytorch/torchchat`) | **Inference / local deployment** | Active for its own scope, but: not a training framework; no RL surface. | **Out of scope.** | | **OpenEnv** (`meta-pytorch/OpenEnv`) | Environment standard (covered separately) | Active. Already a v0 dependency of the framework. | Already adopted. | ### 3.1 Monarch | Field | Value | |---|---| | **Repo** | https://github.com/meta-pytorch/monarch | | **License** | BSD-3-Clause | | **PyPI** | `torchmonarch`; v0.4.1 stable (2026-04-08), v0.5.0 dev wheels published daily through 2026-05-05 | | **Maturity** | **Experimental but actively shipped.** "Currently in an experimental stage" per the repo's own status note, but with a functioning K8s operator, weekly wheels, ProcessMesh/ActorMesh APIs stable enough for VeRL backend experiments. | | **Role in our stack** | **Pure coordination/infra.** It does not train models. It hosts whatever trainer you bring (TRL, VeRL, PRIME-RL, TorchTitan) as `Actor` subclasses on a `ProcMesh`. The `monarch.spmd.SPMDActor` automatically configures `RANK`/`LOCAL_RANK`/`WORLD_SIZE` for any PyTorch-distributed script — i.e., we can lift our existing TRL or PRIME-RL workers into Monarch with minimal change. | | **Key abstractions** | `ProcMesh` (processes × hosts × GPUs), `ActorMesh` (typed actors with `@endpoint` methods), supervision trees, RDMA buffers, distributed tensors / DTensor integration. Underlying runtime: `hyperactor` (Rust). | | **Why over Ray** | Tighter PyTorch/DTensor integration; explicit RDMA data plane (Ray uses object store + standard networking); single-controller mental model maps directly to RL post-training (one controller orchestrates Generator + Trainer + Rewarder + Env actors). | | **Integration cost into Composer Replication** | **~300 LOC + ops**: (a) wrap our PRIME-RL trainer as an `SPMDActor`; (b) wrap our vLLM rollout server as an `Actor` with an `@endpoint generate(prompts)` method; (c) write a single controller script that creates a `ProcMesh`, spawns both meshes, and shuttles `DataProto`-shaped messages; (d) Recipe doc. The ops cost is the harder half — Monarch's K8s operator is new (v0.2.0+). | | **Risk** | Pre-1.0; API churn possible (e.g., `KubernetesJob.add_mesh` signature changed in v0.5). Mitigation: pin to `torchmonarch==0.4.1` for v0.2 of our framework. | ### 3.2 TorchTitan | Field | Value | |---|---| | **Repo** | https://github.com/pytorch/torchtitan | | **License** | BSD-3-Clause | | **Maturity** | **Active development** for pretraining; **experimental** for RL. The GRPO experiment (`torchtitan/experiments/rl/simple_grpo_sum_digits.py`) is in `experiments/`, which the repo explicitly disclaims as removable. | | **Role** | **Training stack only.** Provides FSDP2 (per-parameter sharding), Tensor Parallel (incl. async TP), Pipeline Parallel (zero-bubble), Context Parallel (long-context), `torch.compile`, Float8, MXFP8, DDP, HSDP. | | **OpenEnv-aware?** | No, but the experimental `RLTrainer` integrates `vLLM` + Monarch actors, which is the same shape PRIME-RL uses. | | **Why we don't add it directly** | **PRIME-RL already uses TorchTitan-equivalent FSDP2 internals**, and TorchForge's training core was TorchTitan. Adding TorchTitan as a *direct* dependency would mean writing our own RL loop on top of it — that's TorchForge's job, and Meta paused exactly that effort. The right move is to depend on PRIME-RL, which has battle-tested distributed training patterns equivalent to TorchTitan's, and revisit TorchTitan directly only when we genuinely need its experimental zero-bubble PP or MXFP8 paths. | ### 3.3 TorchForge (Paused) - Repo banner: **"Development paused — LLM training consolidating in TorchTitan."** - ~685 ★, 100+ open issues, last meaningful release in early 2026. - Patterns we should still copy: - Generator/Trainer/Rewarder ActorMesh decomposition - TorchStore-style RDMA weight broadcast - Async toggle between sync PPO-like and fully async off-policy - **We do not add a TorchForge dependency.** Architectural reference only. ### 3.4 torchchat (Out of Scope) - Inference / local deployment of LLMs (Eager / `torch.compile` / AOT Inductor / ExecuTorch / mobile). - No training, no RL. - Mentioned in the brief for completeness; ruled out cleanly. --- ## 4. Comparison Matrix ### 4.1 RL Frameworks | Framework | License | Last release | Maturity | GRPO | DAPO | Custom-loss hook | OpenEnv fit | Est. integration LOC | |---|---|---|---|---|---|---|---|---| | **TRL** (baseline) | Apache-2.0 | Active | Production | ✅ | partial (tricks land per release) | Subclass `GRPOTrainer.compute_loss` | ✅ native (Oct 2025 OpenEnv guide) | already integrated | | **VeRL** (baseline) | Apache-2.0 | Active | Production | ✅ | ✅ | `core_algos.py` + worker subclass | shim via Ray dataloader | already skeleton | | **OpenRLHF** | Apache-2.0 | v0.9.10 (2026-04-04) | Production | ✅ | ✅ | `openrlhf/models/loss.py` + Trainer subclass; **no config hook** | shim via `agent_func_path` | ~400–600 | | **PRIME-RL** ⭐ | Apache-2.0 | v0.5.0 (2026-03-30) | Prod-research | ✅ | partial (DPPO+KL variant; not labeled DAPO) | **`CustomLossConfig` import_path — first-class** | ✅ via `verifiers` (OpenEnv-compatible) | **~200–300** | | **NeMo-Aligner** | Apache-2.0 | Active | Research-leaning | ❌ | ❌ | Megatron model `loss_func` | none; JSONL only | ~800–1,200 | | **Unsloth (RL)** | Apache-2.0 | Active | Production (SFT) / preview (RL) | ✅ (via TRL patch) | ✅ (via TRL patch) | Loss kernels in closed `unsloth_zoo`; effectively unhookable | TRL-shaped | ~1,000+ (forking) | | **LLaMA-Factory** | Apache-2.0 | Active | Production | ❌ (delegates to EasyR1) | ❌ | TRL `Custom*Trainer` subclass | TRL-shaped | ~400 | | **DeepSpeed-Chat** | Apache-2.0 | Stale (Aug 2023 features; 2025 only CI fixes) | Effectively maintained-only | ❌ | ❌ | `DeepSpeedPPOTrainer` subclass | none | ~500 | ### 4.2 Meta PyTorch Stack | Component | Layer | License | Status | In recommendation? | |---|---|---|---|---| | **Monarch** ⭐ | Coordination / actor mesh | BSD-3 | Active (v0.4 GA, v0.5 dev) | **Yes** | | **TorchTitan** | Training stack | BSD-3 | Active; RL experimental | Indirect (via PRIME-RL) | | **TorchForge** | RL library | BSD-3 | **Paused** | No — patterns only | | **torchchat** | Inference / deployment | BSD-3 | Active | No — out of scope | | **OpenEnv** | Environment standard | (Hub) | Active | Already adopted | --- ## 5. Recommendation Rationale ### 5.1 Why PRIME-RL, not OpenRLHF OpenRLHF is in many ways the safer pick: more stars, more contributors, more algorithm coverage (it explicitly ships DAPO). The deciding factor is **the shape of our custom loss**. The Composer Replication Framework's signature contribution is the **three-channel reward**: 1. **RLVR** — tests-pass scalar from the OpenEnv environment. 2. **Composer-style hint-distill (SDPO/OPSD)** — the model self-teaches against its own hint-conditioned roll-outs; needs `teacher_logprobs` aligned to the rollout token grid. 3. **Trace-replay multi-teacher PRM** (the novel bit) — N frozen external teachers' precomputed token-level distributions, replayed against the on-policy rollout. PRIME-RL's `LossInputs` dataclass already exposes exactly the tensors we need: ``` trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask ``` A custom 3-channel loss is roughly: ```python def composer_three_channel_loss(li: LossInputs, *, hint_weight, replay_weight, replay_logits) -> LossOutputs: rlvr = grpo_term(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask) hint = kl_term(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask) replay = kl_term(li.trainer_logprobs, replay_logits, li.loss_mask) return LossOutputs(loss=rlvr + hint_weight * hint + replay_weight * replay, ...) ``` We register this with `trainer.loss.type = "custom"` + `import_path` and we're done. No subclassing, no `exec()`-patched template, no Megatron model wrapping. OpenRLHF would require us to (a) add a `ThreeChannelLoss` `nn.Module` to `openrlhf/models/loss.py`, (b) subclass `PPOTrainer` (or equivalent GRPO trainer) to construct it with the right teacher-logprob plumbing, and (c) carry that fork forward. ~2× the LOC, plus a fork to maintain. A second factor: PRIME-RL's `verifiers` env protocol is a direct precursor of OpenEnv's wire shape (HTTP/WebSocket env servers, typed observations). Our existing OpenEnv-compatible TRL data path translates with a thin adapter. OpenRLHF's `agent_func_path` is more of an escape hatch than a contract. A third factor: PRIME-RL was *built for decentralized training* (INTELLECT-1/2). Even though our v0.1 stays on a single cluster, the v0.2 multi-DC story drops in cleanly. OpenRLHF is Ray-on-one-cluster by design. ### 5.2 Why Monarch, not TorchTitan or TorchForge Among the four Meta-stack components in the brief, only one is both (a) ours to add and (b) genuinely new functionality: - **TorchForge** is paused — depending on it now is a known dead end. - **TorchTitan** is already inside PRIME-RL transitively (PRIME-RL uses FSDP2 plus a SHARDCAST weight-broadcast layer that is morally equivalent to what TorchTitan offers). Adding TorchTitan as a *direct* dependency means writing our own RL loop on top of it, which is exactly what TorchForge tried and paused. We get TorchTitan's benefits without owning the integration. - **torchchat** is for local inference / mobile deployment — out of scope. - **Monarch** is the unique value: a PyTorch-native actor mesh that lets us replace Ray (PRIME-RL's current orchestration substrate) with something that has explicit RDMA, supervision trees, and ProcMesh/ActorMesh primitives that map directly onto our (Generator, Trainer, Rewarder, EnvServer) topology. The migration path is incremental: - **v0.1:** PRIME-RL on Ray (current). Monarch listed as roadmap. - **v0.2:** Wrap PRIME-RL's Trainer as a `monarch.spmd.SPMDActor`, vLLM Generator as an `Actor` with an `@endpoint generate()`. Switch the orchestrator from `ray.init()` to `this_host().spawn_procs()`. - Risk-mitigation: pin to `torchmonarch==0.4.1` (the last GA release before v0.5 dev). Keep a Ray fallback path active until v0.2 is stable. --- ## 6. Integration Sketches ### 6.1 PRIME-RL Recipe skeleton `recipes/composer_v0_prime_rl.toml` (~30 LOC): ```toml # composer_v0_prime_rl.toml [model] name = "Qwen/Qwen3-32B" # or Kimi-K2.5 when MoE support lands [data] env = "swe_bench_lite" # via verifiers EnvServer; wraps our OpenEnv adapter batch_size = 64 group_size = 16 [trainer] algorithm = "grpo" > **Realised in v0.1 (Wave 17 update):** Wave 14b shipped the PRIME-RL > recipe at `composer_replication/recipes/prime_rl/prime_rl_config.yaml` > as **YAML** with a different kwarg surface than the TOML sketch below. > The actual recipe shape: > > ```yaml > # composer_replication/recipes/prime_rl/prime_rl_config.yaml > model: > base: "Qwen/Qwen2.5-0.5B" > attn_implementation: "flash_attention_2" > dtype: "bfloat16" > env: > protocol: "verifiers" > config: { name: "math/gsm8k", split: "train" } > loss: > custom: > import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn" > kwargs: > alpha_sdpo: 0.0 # channel 2 deferred in v0 > beta_dpo: 0.0 # channel 3 out-of-scope for PRIME-RL v0 > dppo_mask_high: 0.2 # PRIME-RL DPPO convention (NOT textbook PPO) > dppo_mask_low: 0.2 # both must be >= 0 per Field(..., ge=0) > adv_tau: 1.0 # advantage normalization > kl_tau: 0.04 # KL coefficient > ``` > > The realised `loss_fn(inputs, **kwargs)` matches PRIME-RL's > `LossInputs`/`LossOutputs` interface (read upstream `prime_rl/loss.py` > for parity verification — Wave 14b's shadow-parity test independently > restates the formula in > `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`). > > The pre-Wave-14b TOML/`hint_weight`/`replay_weight` sketch below is > preserved as historical proposal context. [trainer.loss] type = "custom" import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn" [trainer.loss.kwargs] hint_weight = 0.5 replay_weight = 0.25 replay_logits_path = "/data/teachers/precomputed_replay.zarr" [teacher] model = "Qwen/Qwen3-32B" # same as policy = self-teacher for hint-distill hint_template = "composer.hint_v1" [orchestrator] sync_mode = "async" shardcast = true ``` `composer_replication/recipes/prime_rl/composer_loss.py` (~120 LOC; current Wave 14b implementation defines `loss_fn(inputs, **kwargs)` rather than the `composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)` signature sketched below): ```python # composer_replication/recipes/prime_rl/composer_loss.py — sketch only; # the actual signature evolved during Wave 14b. See module docstring for # the current `loss_fn` contract. from prime_rl.trainer.rl.loss import LossInputs, LossOutputs def composer_three_channel_loss( li: LossInputs, *, hint_weight: float, replay_weight: float, replay_logits_handle: str, ) -> LossOutputs: # 1. RLVR via GRPO surrogate rlvr = grpo_surrogate(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask) # 2. Hint-distill: KL(policy || hint-conditioned teacher) hint = masked_kl(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask) # 3. Trace-replay: KL(policy || precomputed multi-teacher mixture) replay = trace_replay_kl(li.trainer_logprobs, replay_logits_handle, li.loss_mask) total = rlvr + hint_weight * hint + replay_weight * replay return LossOutputs( loss=total, metrics={"rlvr": rlvr.item(), "hint": hint.item(), "replay": replay.item()}, ) ``` Plus `docs/recipes/composer_v0_prime_rl.md` (~50 LOC) describing data layout, teacher precomputation, and reproducibility hashes. **Total: ~200 LOC of code + ~30 LOC config + ~50 LOC docs ≈ 280 LOC.** ### 6.2 Monarch wrap-up sketch (v0.2) ```python # composer_replication/orchestrator/monarch_runner.py (~120 LOC) from monarch.actor import Actor, endpoint from monarch.proc_mesh import this_host, ProcMesh class TrainerActor(Actor): @endpoint async def step(self, batch): ... class GeneratorActor(Actor): @endpoint async def generate(self, prompts): ... class RewarderActor(Actor): @endpoint async def score(self, traj): ... async def main(cfg): train_mesh = await this_host().spawn_procs(TrainerActor, hosts=4, gpus=8) gen_mesh = await this_host().spawn_procs(GeneratorActor, hosts=2, gpus=8) rew_mesh = await this_host().spawn_procs(RewarderActor, hosts=1, gpus=2) async for step in range(cfg.steps): prompts = await env.batch() traj = await gen_mesh.generate.broadcast(prompts) rewards = await rew_mesh.score.broadcast(traj) await train_mesh.step.broadcast({"traj": traj, "rewards": rewards}) ``` **Total: ~120 LOC controller + ~50 LOC ops (K8s operator manifest) + ~80 LOC recipe doc ≈ 250 LOC.** --- ## 7. Sources ### Primary - **OpenRLHF** — https://github.com/OpenRLHF/OpenRLHF (README, Releases v0.9.10), Apache-2.0; DeepWiki: `openrlhf/models/loss.py`, `agent_func_path`. - **PRIME-RL** — https://github.com/PrimeIntellect-ai/prime-rl (README, Releases v0.5.0), Apache-2.0; DeepWiki: `src/prime_rl/trainer/rl/loss.py`, `CustomLossConfig`, `LossInputs`/`LossOutputs`, `verifiers` integration. - **NeMo-Aligner** — https://github.com/NVIDIA/NeMo-Aligner, Apache-2.0; DeepWiki: PPO/REINFORCE/DPO/RPO; `loss_func` on Megatron model classes. - **Unsloth** — https://github.com/unslothai/unsloth, README RL section; DeepWiki: `patch_trl_rl_trainers()`, `unsloth_zoo` kernels, DAPO loss-type switch. - **LLaMA-Factory** — https://github.com/hiyouga/LLaMA-Factory, Apache-2.0; DeepWiki: `CustomPPOTrainer`/`CustomDPOTrainer`, EasyR1 reference for GRPO. - **DeepSpeed-Chat** — https://github.com/deepspeedai/DeepSpeedExamples (`applications/DeepSpeed-Chat/`), Apache-2.0; DeepWiki: 3-stage PPO, DPO; "Latest News" cutoff Aug 2023; 2025 PRs (#6982, #7015, #7052) confirming maintenance-only mode. - **Monarch** — https://github.com/meta-pytorch/monarch, BSD-3; PyPI `torchmonarch` v0.4.1 (2026-04-08), v0.5.0 dev wheels through 2026-05-05; DeepWiki: `ProcMesh`, `ActorMesh`, `monarch.spmd.SPMDActor`. - **TorchTitan** — https://github.com/pytorch/torchtitan, BSD-3; DeepWiki: FSDP2/TP/PP/CP, `torchtitan/experiments/rl/simple_grpo_sum_digits.py`, integration with vLLM and Monarch. - **TorchForge** — https://github.com/meta-pytorch/forge, BSD-3, repo banner "development paused — consolidating in TorchTitan". - **torchchat** — https://github.com/pytorch/torchchat, BSD-3; DeepWiki: inference-only (eager / `torch.compile` / AOT Inductor / ExecuTorch). ### Companion repository docs (already present) - `~/wiki/research/post-training-framework/04-verl-trl.md` — VeRL vs TRL deep dive. - `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` — full Meta-stack survey. - `~/wiki/research/post-training-framework/02-diloco-family.md` — DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2. - `~/wiki/projects/composer-replication-framework.md` — current TL;DR and stage plan. ### Notes on accuracy - "DAPO" labeling: OpenRLHF and Unsloth both advertise DAPO as a first-class loss type; PRIME-RL implements a DAPO-equivalent (decoupled-clip + KL) but uses the internal name `DPPO+KL` in its default loss. For our purposes this is the same family. - Last-commit dates and release versions are pulled from GitHub release pages (OpenRLHF, PRIME-RL) and PyPI release history (`torchmonarch`). - Star counts and contributor counts reflect the snapshots returned by web search at the time of writing (May 2026) and will drift; the relative ordering is stable.