Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports

a84c060 13 days ago

33.2 kB

	# RL Post-Training Frameworks Landscape & Meta PyTorch Stack Audit

	> Generated: 2026-05-25
	> Scope: Audit of RL post-training frameworks beyond TRL+VeRL plus Meta's PyTorch agentic stack components, with a recommendation of two additions to the Composer Replication Framework.
	> Feeds: ADR-006 (Algorithm-substrate selection)
	> Companion docs: `~/wiki/research/post-training-framework/04-verl-trl.md`, `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md`, `~/wiki/research/post-training-framework/02-diloco-family.md`

	---

	## TL;DR — Recommendation

	\| Slot \| Pick \| Why \|
	\|---\|---\|---\|
	\| RL framework #3 (after TRL, VeRL) \| PRIME-RL (PrimeIntellect-ai/prime-rl) \| First-class `CustomLossConfig` extension point (`trainer.loss.type=custom` + `import_path`) — the cleanest place we have to drop our 3-channel loss (RLVR + hint-distill + trace-replay) without forking. Already uses the `verifiers` env protocol that bridges to OpenEnv. Async, decentralized substrate. Apache-2.0. INTELLECT-2 production receipts. \|
	\| Infra component (Meta stack) \| Monarch (`meta-pytorch/monarch`) as the actor-mesh control plane; TorchTitan is also tracked as the FSDP2/TP/PP training core but is already the trainer inside both PRIME-RL and TorchForge, so we adopt it transitively. The single net-new dependency is Monarch. \| Monarch is the only Meta-stack component that is (a) actively shipped (v0.4 GA, v0.5 dev, weekly wheels), (b) decoupled from the now-paused TorchForge, and (c) able to host any SPMD trainer (TRL, VeRL, PRIME-RL) as an `ActorMesh`. BSD-3. Replaces Ray when our v0.2 lands. \|

	What we do NOT add:
	- OpenRLHF — strong production framework (v0.9.10, 9.3K★, supports DAPO) but its custom-loss path requires modifying `openrlhf/models/loss.py` + a `Trainer` subclass. Strictly worse extension story than PRIME-RL for our specific need (3-channel loss).
	- NeMo-Aligner — no GRPO, no DAPO, heavy NeMo/Megatron dependency. Wrong shape.
	- Unsloth — TRL wrapper, RL kernels live in closed `unsloth_zoo`. We'd have to fork.
	- LLaMA-Factory — TRL wrapper, no GRPO/DAPO (delegates to EasyR1).
	- DeepSpeed-Chat — effectively unmaintained for new RL algos since Aug 2023; PPO/DPO only.
	- TorchForge — Meta has marked the repo "development paused, consolidating into TorchTitan." Borrow patterns; do not depend on it.
	- torchchat — inference / local deployment only; no training. Out of scope.

	---

	## Table of Contents

	1. [Audit Methodology](#1-audit-methodology)
	2. [RL Framework Audit](#2-rl-framework-audit)
	1. [OpenRLHF](#21-openrlhf)
	2. [PRIME-RL](#22-prime-rl)
	3. [NeMo-Aligner](#23-nemo-aligner)
	4. [Unsloth (RL)](#24-unsloth-rl)
	5. [LLaMA-Factory](#25-llama-factory)
	6. [DeepSpeed-Chat](#26-deepspeed-chat)
	3. [Meta PyTorch Agentic Stack — Infra vs Training Split](#3-meta-pytorch-agentic-stack)
	1. [Monarch (coordination/infra)](#31-monarch)
	2. [TorchTitan (training stack)](#32-torchtitan)
	3. [TorchForge (paused)](#33-torchforge)
	4. [torchchat (out of scope)](#34-torchchat)
	4. [Comparison Matrix](#4-comparison-matrix)
	5. [Recommendation Rationale](#5-recommendation-rationale)
	6. [Integration Sketches](#6-integration-sketches)
	7. [Sources](#7-sources)

	---

	## 1. Audit Methodology

	For each framework, we capture five fields that determine whether it can host the Composer Replication Framework's three-channel loss (RLVR + hint-distill + trace-replay) on our existing OpenEnv-compatible TRL data path:

	1. Repo + license + last commit + maturity — primary GitHub source, license grade for redistribution, recency, and whether the project is production, research, or archived.
	2. Algorithm coverage — does it ship GRPO and DAPO out of the box? (DAPO matters because Composer-style training inherits its decoupled clip + dynamic sampling fixes for length and std biases.)
	3. Custom-loss extension point — concrete file/class/config where a custom 3-channel loss can be plugged. We strongly prefer a stable public hook over forking.
	4. Integration cost — rough lines of code needed for a `Recipe` doc + a skeleton `Trainer` subclass that runs end-to-end on a small env.
	5. OpenEnv data-path fit — does it already consume the OpenEnv contract (typed `reset`/`step`/`close`, MCP tool-calling) directly, or do we have to write a shim?

	Primary sources: each repo's `README.md`, official releases page, and DeepWiki audits (where indexed). Secondary checks: PyPI release timelines for Meta packages.

	---

	## 2. RL Framework Audit

	### 2.1 OpenRLHF

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/OpenRLHF/OpenRLHF \|
	\| License \| Apache-2.0 \|
	\| Stars / contributors \| 9,312 ★ / 90 contributors \|
	\| Latest release \| v0.9.10, 2026-04-04 \|
	\| Last push \| 2026-04-05 \|
	\| Maturity \| Production — used in many public RLHF runs since 2023; tagline "An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)" \|
	\| Algorithms \| PPO, GRPO, DAPO (release notes; advertised as a primary feature in v0.9.x), REINFORCE++, REINFORCE++-baseline, RLOO, GSPO, Async RL, TIS (truncated importance sampling) \|
	\| Custom-loss extension point \| `openrlhf/models/loss.py` — `PolicyLoss`, `DPOLoss`, `SFTLoss`, `PairWiseLoss`, `LogExpLoss` are concrete `nn.Module`s. To add a 3-channel loss you would (a) add a new `nn.Module` (e.g. `ThreeChannelLoss`) here, then (b) subclass the relevant `Trainer` (e.g. `PPOTrainer` / a new GRPO-derived trainer) and replace `self.loss_fn`. There is no config-driven custom-loss hook equivalent to PRIME-RL's `CustomLossConfig` — you fork or vendor. \|
	\| Integration cost \| Higher than PRIME-RL. Estimated ~400–600 LOC: ~150 LOC for a `ThreeChannelLoss` module, ~200 LOC for a `ComposerGRPOTrainer` subclass that routes the three signals (RLVR scalar, hint-distill teacher logprobs, trace-replay teacher logits), ~50 LOC for a `Recipe` doc, plus reward-fn glue. \|
	\| Data-path fit \| OpenRLHF's input is HF chat templates + a Python reward function or a remote reward URL (`--reward.remote_url`, `--train.agent_func_path`). It does not speak the OpenEnv `reset/step` protocol natively, but our existing OpenEnv→TRL adapter could be reused as a callable behind `agent_func_path`. Medium lift to wire OpenEnv. \|

	Verdict: Strong, mature, well-funded codebase with the most complete algorithm coverage of any candidate. Loses to PRIME-RL only because PRIME-RL has a first-class config-driven custom-loss hook that fits our exact need, and PRIME-RL already has the `verifiers`/OpenEnv shape baked into the orchestrator. We keep OpenRLHF on the radar as a fallback substrate if PRIME-RL's decentralized story is overkill for v0.1.

	---

	### 2.2 PRIME-RL

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/PrimeIntellect-ai/prime-rl \|
	\| License \| Apache-2.0 \|
	\| Stars / contributors \| 1,398 ★ / 60 contributors \|
	\| Latest release \| v0.5.0, 2026-03-30 \|
	\| Last push \| 2026-05-25 (active today) \|
	\| Maturity \| Production-research hybrid — substrate behind INTELLECT-1/2 multi-DC runs; tagline "Async RL Training at Scale". Decentralized DiLoCo-shape compute is its differentiator. \|
	\| Algorithms \| GRPO, GSPO, on-policy distillation with a teacher model. `default_loss_fn` = DPPO + KL (a GRPO variant; similar lineage to DAPO's decoupled-clip idea but the upstream "DAPO" label is not used verbatim). \|
	\| Custom-loss extension point \| Best in class. `src/prime_rl/trainer/rl/loss.py` exposes a `LossInputs`/`LossOutputs` interface and `setup_loss_fn` resolves a config: `trainer.loss.type = "custom"` + `trainer.loss.import_path = "your_pkg.your_module.your_loss_fn"` + optional kwargs. The custom function receives `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask` — i.e., the exact tensor inputs needed for a 3-channel loss (RLVR uses `advantages`, hint-distill uses `teacher_logprobs`, trace-replay can be threaded through `kwargs` as a precomputed reference). \|
	\| Integration cost \| Lowest. Estimated ~200–300 LOC total: ~120 LOC for a `composer_three_channel_loss` function in our package + ~30 LOC of config (`recipes/composer_v0.toml`), ~80 LOC `Recipe` doc. No subclassing required for the loss. A small adapter is needed if we precompute the trace-replay teacher distribution outside the `LossInputs` struct. \|
	\| Data-path fit \| Already aligned. PRIME-RL's orchestrator consumes `verifiers` environments via `vf.EnvServer`. The OpenEnv ↔ verifiers shim is a known small adapter (the `verifiers` library is the Hub-side env runner that OpenEnv's TRL guide already uses). Our existing OpenEnv-compatible TRL data path drops in with a thin wrapper. \|

	Verdict: Best fit for the framework. The combination of (i) config-driven custom loss with the right tensor signatures already present, (ii) verifiers/OpenEnv shape, (iii) decentralized async training that maps to our DiLoCo plans, makes PRIME-RL the substrate of choice for v0.1. Recommended addition #1.

	---

	### 2.3 NeMo-Aligner

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/NVIDIA/NeMo-Aligner \|
	\| License \| Apache-2.0 \|
	\| Maturity \| Research-leaning production — NVIDIA-maintained, tied to NeMo/Megatron-LM. Advertised as "early stages of development" in its own README. \|
	\| Algorithms \| PPO, REINFORCE, RS (Rejection Sampling), DPO, RPO. No GRPO. No DAPO. \|
	\| Custom-loss extension point \| `loss_func` method on Megatron model classes (e.g. `MegatronGPTDPOModel.loss_func`). Requires NeMo model-class subclassing and Megatron-LM familiarity. \|
	\| Integration cost \| High. Estimated ~800–1,200 LOC including .nemo conversion of HF weights, Megatron model wrapping, custom Megatron `loss_func`, and a recipe. Plus the operational cost of running on Megatron-LM (Triton kernels, NeMo container). \|
	\| Data-path fit \| JSONL only; no OpenEnv. We'd write a full env adapter. \|

	Verdict: Wrong shape. No GRPO/DAPO and tightly bound to the NeMo ecosystem. Only relevant if we ever need NVIDIA-supported large-scale Megatron RL, which we don't for the Composer Replication v0.1/v0.2 horizon. Reject.

	---

	### 2.4 Unsloth (RL)

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/unslothai/unsloth \|
	\| License \| Apache-2.0 (per public README; not surfaced by DeepWiki snapshot but well-known) \|
	\| Maturity \| Production for SFT and LoRA/QLoRA; research/preview for RL — RL support shipped in 2025 as a TRL patcher. \|
	\| Algorithms \| Wraps TRL → inherits TRL's GRPO; loss-type switch supports `"grpo"`, `"bnpo"`, `"dr_grpo"`, `"dapo"`, `"cispo"`. So GRPO and DAPO are both available through the patched-TRL path. \|
	\| Custom-loss extension point \| Problematic. The actual loss kernels live in `unsloth_zoo` (a separate compiled dependency). The patcher (`patch_trl_rl_trainers()`) generates modified TRL trainer classes via `exec()` from string templates. To add a new loss type you would have to (a) modify or fork `unsloth_zoo` to add a kernel, (b) extend `RL_REPLACEMENTS`, and (c) extend the `compute_loss()` switch in the patcher template. There is no public Python subclass hook that survives the patching. \|
	\| Integration cost \| Very high if we want our own loss. Forking `unsloth_zoo` defeats the purpose of using Unsloth (which is the optimized kernels). Estimated ~1,000+ LOC plus an external repo to maintain. \|
	\| Data-path fit \| TRL-shaped, so OpenEnv via TRL is fine — but only for stock TRL losses. Our 3-channel loss does not survive Unsloth's patching. \|

	Verdict: Excellent for memory-efficient SFT and stock-GRPO LoRA. Wrong tool for a custom loss. Reject as the substrate; we may still use it as an optional QLoRA accelerator inside a stock-GRPO ablation run.

	---

	### 2.5 LLaMA-Factory

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/hiyouga/LLaMA-Factory \|
	\| License \| Apache-2.0 \|
	\| Maturity \| Production for breadth (50+ model families, SFT/DPO/PPO recipes), but RL is a thin TRL wrapper. \|
	\| Algorithms \| PPO, DPO, KTO, ORPO, SimPO via `CustomTrainer` subclasses of the corresponding `trl.Trainer` classes. No GRPO. No DAPO in the repo itself; the README points to EasyR1 (an external GRPO framework) for those. \|
	\| Custom-loss extension point \| `compute_preference_loss` switch on `CustomDPOTrainer` (selects `sigmoid` / `hinge` / `ipo` / `kto_pair` / `orpo` / `simpo`). For PPO, you would subclass `CustomPPOTrainer` → which is `trl.PPOTrainer`. Effectively the same extension story as plain TRL, with a configuration layer on top. \|
	\| Integration cost \| Moderate, ~400 LOC, but you are essentially using TRL through one extra layer. \|
	\| Data-path fit \| Text/dataset-shaped, not OpenEnv-aware. Same OpenEnv-via-TRL story. \|

	Verdict: Useful as a multi-model SFT laboratory but does not move the ball for our RL-side requirements. Reject as substrate; we already have TRL.

	---

	### 2.6 DeepSpeed-Chat

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/deepspeedai/DeepSpeedExamples (the `applications/DeepSpeed-Chat/` subtree) \|
	\| License \| Apache-2.0 \|
	\| Maturity \| Effectively stale. The README's "Latest News" cuts off in August 2023. CI patches in 2025 (e.g., #6982, #7015, #7052) are dependency-pinning fixes, not feature work. The roadmap to "generalize DeepSpeed-RLHF abstraction for a wider range of RL algorithms" has not landed. \|
	\| Algorithms \| PPO (3-stage RLHF) + DPO. No GRPO. No DAPO. \|
	\| Custom-loss extension point \| `DeepSpeedPPOTrainer.train_rlhf` / `actor_loss_fn` / `critic_loss_fn`. Editable but not config-hooked. \|
	\| Integration cost \| Moderate, but you inherit a frozen architecture. ~500 LOC. \|
	\| Data-path fit \| Prompt-dataset-shaped; no OpenEnv. \|

	Verdict: Pioneering for its time, no longer competitive on algorithm coverage. Reject.

	---

	## 3. Meta PyTorch Agentic Stack — Infra vs Training Split

	The brief asked specifically to distinguish coordination/infra from training-stack components. The answer is:

	\| Component \| Layer \| Status (May 2026) \| In our framework? \|
	\|---\|---\|---\|---\|
	\| Monarch (`meta-pytorch/monarch`) \| Coordination / Infra — actor mesh, RDMA data plane, supervision trees \| Active. v0.4 GA (2026-03-26), v0.5 dev wheels daily, BSD-3 \| Yes — recommended addition. \|
	\| TorchTitan (`pytorch/torchtitan`) \| Training stack — FSDP2 / TP / PP / CP / float8 / MXFP8 \| Active. BSD-3, "extensive development". Has an experimental GRPO recipe (`experiments/rl/simple_grpo_sum_digits.py`) on Monarch. \| Indirectly — already the trainer inside PRIME-RL and TorchForge. We adopt it transitively, not as a direct dependency. \|
	\| TorchForge (`meta-pytorch/forge`) \| RL post-training library \| Development paused per the repo banner; consolidating into TorchTitan. ~685★. \| Pattern reference only. Lift the Generator/Trainer/Rewarder shape but do not depend on the package. \|
	\| torchchat (`pytorch/torchchat`) \| Inference / local deployment \| Active for its own scope, but: not a training framework; no RL surface. \| Out of scope. \|
	\| OpenEnv (`meta-pytorch/OpenEnv`) \| Environment standard (covered separately) \| Active. Already a v0 dependency of the framework. \| Already adopted. \|

	### 3.1 Monarch

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/meta-pytorch/monarch \|
	\| License \| BSD-3-Clause \|
	\| PyPI \| `torchmonarch`; v0.4.1 stable (2026-04-08), v0.5.0 dev wheels published daily through 2026-05-05 \|
	\| Maturity \| Experimental but actively shipped. "Currently in an experimental stage" per the repo's own status note, but with a functioning K8s operator, weekly wheels, ProcessMesh/ActorMesh APIs stable enough for VeRL backend experiments. \|
	\| Role in our stack \| Pure coordination/infra. It does not train models. It hosts whatever trainer you bring (TRL, VeRL, PRIME-RL, TorchTitan) as `Actor` subclasses on a `ProcMesh`. The `monarch.spmd.SPMDActor` automatically configures `RANK`/`LOCAL_RANK`/`WORLD_SIZE` for any PyTorch-distributed script — i.e., we can lift our existing TRL or PRIME-RL workers into Monarch with minimal change. \|
	\| Key abstractions \| `ProcMesh` (processes × hosts × GPUs), `ActorMesh` (typed actors with `@endpoint` methods), supervision trees, RDMA buffers, distributed tensors / DTensor integration. Underlying runtime: `hyperactor` (Rust). \|
	\| Why over Ray \| Tighter PyTorch/DTensor integration; explicit RDMA data plane (Ray uses object store + standard networking); single-controller mental model maps directly to RL post-training (one controller orchestrates Generator + Trainer + Rewarder + Env actors). \|
	\| Integration cost into Composer Replication \| ~300 LOC + ops: (a) wrap our PRIME-RL trainer as an `SPMDActor`; (b) wrap our vLLM rollout server as an `Actor` with an `@endpoint generate(prompts)` method; (c) write a single controller script that creates a `ProcMesh`, spawns both meshes, and shuttles `DataProto`-shaped messages; (d) Recipe doc. The ops cost is the harder half — Monarch's K8s operator is new (v0.2.0+). \|
	\| Risk \| Pre-1.0; API churn possible (e.g., `KubernetesJob.add_mesh` signature changed in v0.5). Mitigation: pin to `torchmonarch==0.4.1` for v0.2 of our framework. \|

	### 3.2 TorchTitan

	\| Field \| Value \|
	\|---\|---\|
	\| Repo \| https://github.com/pytorch/torchtitan \|
	\| License \| BSD-3-Clause \|
	\| Maturity \| Active development for pretraining; experimental for RL. The GRPO experiment (`torchtitan/experiments/rl/simple_grpo_sum_digits.py`) is in `experiments/`, which the repo explicitly disclaims as removable. \|
	\| Role \| Training stack only. Provides FSDP2 (per-parameter sharding), Tensor Parallel (incl. async TP), Pipeline Parallel (zero-bubble), Context Parallel (long-context), `torch.compile`, Float8, MXFP8, DDP, HSDP. \|
	\| OpenEnv-aware? \| No, but the experimental `RLTrainer` integrates `vLLM` + Monarch actors, which is the same shape PRIME-RL uses. \|
	\| Why we don't add it directly \| PRIME-RL already uses TorchTitan-equivalent FSDP2 internals, and TorchForge's training core was TorchTitan. Adding TorchTitan as a direct dependency would mean writing our own RL loop on top of it — that's TorchForge's job, and Meta paused exactly that effort. The right move is to depend on PRIME-RL, which has battle-tested distributed training patterns equivalent to TorchTitan's, and revisit TorchTitan directly only when we genuinely need its experimental zero-bubble PP or MXFP8 paths. \|

	### 3.3 TorchForge (Paused)

	- Repo banner: "Development paused — LLM training consolidating in TorchTitan."
	- ~685 ★, 100+ open issues, last meaningful release in early 2026.
	- Patterns we should still copy:
	- Generator/Trainer/Rewarder ActorMesh decomposition
	- TorchStore-style RDMA weight broadcast
	- Async toggle between sync PPO-like and fully async off-policy
	- We do not add a TorchForge dependency. Architectural reference only.

	### 3.4 torchchat (Out of Scope)

	- Inference / local deployment of LLMs (Eager / `torch.compile` / AOT Inductor / ExecuTorch / mobile).
	- No training, no RL.
	- Mentioned in the brief for completeness; ruled out cleanly.

	---

	## 4. Comparison Matrix

	### 4.1 RL Frameworks

	\| Framework \| License \| Last release \| Maturity \| GRPO \| DAPO \| Custom-loss hook \| OpenEnv fit \| Est. integration LOC \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| TRL (baseline) \| Apache-2.0 \| Active \| Production \| ✅ \| partial (tricks land per release) \| Subclass `GRPOTrainer.compute_loss` \| ✅ native (Oct 2025 OpenEnv guide) \| already integrated \|
	\| VeRL (baseline) \| Apache-2.0 \| Active \| Production \| ✅ \| ✅ \| `core_algos.py` + worker subclass \| shim via Ray dataloader \| already skeleton \|
	\| OpenRLHF \| Apache-2.0 \| v0.9.10 (2026-04-04) \| Production \| ✅ \| ✅ \| `openrlhf/models/loss.py` + Trainer subclass; no config hook \| shim via `agent_func_path` \| ~400–600 \|
	\| PRIME-RL ⭐ \| Apache-2.0 \| v0.5.0 (2026-03-30) \| Prod-research \| ✅ \| partial (DPPO+KL variant; not labeled DAPO) \| `CustomLossConfig` import_path — first-class \| ✅ via `verifiers` (OpenEnv-compatible) \| ~200–300 \|
	\| NeMo-Aligner \| Apache-2.0 \| Active \| Research-leaning \| ❌ \| ❌ \| Megatron model `loss_func` \| none; JSONL only \| ~800–1,200 \|
	\| Unsloth (RL) \| Apache-2.0 \| Active \| Production (SFT) / preview (RL) \| ✅ (via TRL patch) \| ✅ (via TRL patch) \| Loss kernels in closed `unsloth_zoo`; effectively unhookable \| TRL-shaped \| ~1,000+ (forking) \|
	\| LLaMA-Factory \| Apache-2.0 \| Active \| Production \| ❌ (delegates to EasyR1) \| ❌ \| TRL `Custom*Trainer` subclass \| TRL-shaped \| ~400 \|
	\| DeepSpeed-Chat \| Apache-2.0 \| Stale (Aug 2023 features; 2025 only CI fixes) \| Effectively maintained-only \| ❌ \| ❌ \| `DeepSpeedPPOTrainer` subclass \| none \| ~500 \|

	### 4.2 Meta PyTorch Stack

	\| Component \| Layer \| License \| Status \| In recommendation? \|
	\|---\|---\|---\|---\|---\|
	\| Monarch ⭐ \| Coordination / actor mesh \| BSD-3 \| Active (v0.4 GA, v0.5 dev) \| Yes \|
	\| TorchTitan \| Training stack \| BSD-3 \| Active; RL experimental \| Indirect (via PRIME-RL) \|
	\| TorchForge \| RL library \| BSD-3 \| Paused \| No — patterns only \|
	\| torchchat \| Inference / deployment \| BSD-3 \| Active \| No — out of scope \|
	\| OpenEnv \| Environment standard \| (Hub) \| Active \| Already adopted \|

	---

	## 5. Recommendation Rationale

	### 5.1 Why PRIME-RL, not OpenRLHF

	OpenRLHF is in many ways the safer pick: more stars, more contributors, more algorithm coverage (it explicitly ships DAPO). The deciding factor is the shape of our custom loss.

	The Composer Replication Framework's signature contribution is the three-channel reward:

	1. RLVR — tests-pass scalar from the OpenEnv environment.
	2. Composer-style hint-distill (SDPO/OPSD) — the model self-teaches against its own hint-conditioned roll-outs; needs `teacher_logprobs` aligned to the rollout token grid.
	3. Trace-replay multi-teacher PRM (the novel bit) — N frozen external teachers' precomputed token-level distributions, replayed against the on-policy rollout.

	PRIME-RL's `LossInputs` dataclass already exposes exactly the tensors we need:
	```
	trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask
	```
	A custom 3-channel loss is roughly:
	```python
	def composer_three_channel_loss(li: LossInputs, *, hint_weight, replay_weight, replay_logits) -> LossOutputs:
	rlvr = grpo_term(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask)
	hint = kl_term(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)
	replay = kl_term(li.trainer_logprobs, replay_logits, li.loss_mask)
	return LossOutputs(loss=rlvr + hint_weight * hint + replay_weight * replay, ...)
	```
	We register this with `trainer.loss.type = "custom"` + `import_path` and we're done. No subclassing, no `exec()`-patched template, no Megatron model wrapping.

	OpenRLHF would require us to (a) add a `ThreeChannelLoss` `nn.Module` to `openrlhf/models/loss.py`, (b) subclass `PPOTrainer` (or equivalent GRPO trainer) to construct it with the right teacher-logprob plumbing, and (c) carry that fork forward. ~2× the LOC, plus a fork to maintain.

	A second factor: PRIME-RL's `verifiers` env protocol is a direct precursor of OpenEnv's wire shape (HTTP/WebSocket env servers, typed observations). Our existing OpenEnv-compatible TRL data path translates with a thin adapter. OpenRLHF's `agent_func_path` is more of an escape hatch than a contract.

	A third factor: PRIME-RL was built for decentralized training (INTELLECT-1/2). Even though our v0.1 stays on a single cluster, the v0.2 multi-DC story drops in cleanly. OpenRLHF is Ray-on-one-cluster by design.

	### 5.2 Why Monarch, not TorchTitan or TorchForge

	Among the four Meta-stack components in the brief, only one is both (a) ours to add and (b) genuinely new functionality:

	- TorchForge is paused — depending on it now is a known dead end.
	- TorchTitan is already inside PRIME-RL transitively (PRIME-RL uses FSDP2 plus a SHARDCAST weight-broadcast layer that is morally equivalent to what TorchTitan offers). Adding TorchTitan as a direct dependency means writing our own RL loop on top of it, which is exactly what TorchForge tried and paused. We get TorchTitan's benefits without owning the integration.
	- torchchat is for local inference / mobile deployment — out of scope.
	- Monarch is the unique value: a PyTorch-native actor mesh that lets us replace Ray (PRIME-RL's current orchestration substrate) with something that has explicit RDMA, supervision trees, and ProcMesh/ActorMesh primitives that map directly onto our (Generator, Trainer, Rewarder, EnvServer) topology.

	The migration path is incremental:
	- v0.1: PRIME-RL on Ray (current). Monarch listed as roadmap.
	- v0.2: Wrap PRIME-RL's Trainer as a `monarch.spmd.SPMDActor`, vLLM Generator as an `Actor` with an `@endpoint generate()`. Switch the orchestrator from `ray.init()` to `this_host().spawn_procs()`.
	- Risk-mitigation: pin to `torchmonarch==0.4.1` (the last GA release before v0.5 dev). Keep a Ray fallback path active until v0.2 is stable.

	---

	## 6. Integration Sketches

	### 6.1 PRIME-RL Recipe skeleton

	`recipes/composer_v0_prime_rl.toml` (~30 LOC):

	```toml
	# composer_v0_prime_rl.toml
	[model]
	name = "Qwen/Qwen3-32B" # or Kimi-K2.5 when MoE support lands

	[data]
	env = "swe_bench_lite" # via verifiers EnvServer; wraps our OpenEnv adapter
	batch_size = 64
	group_size = 16

	[trainer]
	algorithm = "grpo"

	> Realised in v0.1 (Wave 17 update): Wave 14b shipped the PRIME-RL
	> recipe at `composer_replication/recipes/prime_rl/prime_rl_config.yaml`
	> as YAML with a different kwarg surface than the TOML sketch below.
	> The actual recipe shape:
	>
	> ```yaml
	> # composer_replication/recipes/prime_rl/prime_rl_config.yaml
	> model:
	> base: "Qwen/Qwen2.5-0.5B"
	> attn_implementation: "flash_attention_2"
	> dtype: "bfloat16"
	> env:
	> protocol: "verifiers"
	> config: { name: "math/gsm8k", split: "train" }
	> loss:
	> custom:
	> import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
	> kwargs:
	> alpha_sdpo: 0.0 # channel 2 deferred in v0
	> beta_dpo: 0.0 # channel 3 out-of-scope for PRIME-RL v0
	> dppo_mask_high: 0.2 # PRIME-RL DPPO convention (NOT textbook PPO)
	> dppo_mask_low: 0.2 # both must be >= 0 per Field(..., ge=0)
	> adv_tau: 1.0 # advantage normalization
	> kl_tau: 0.04 # KL coefficient
	> ```
	>
	> The realised `loss_fn(inputs, **kwargs)` matches PRIME-RL's
	> `LossInputs`/`LossOutputs` interface (read upstream `prime_rl/loss.py`
	> for parity verification — Wave 14b's shadow-parity test independently
	> restates the formula in
	> `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`).
	>
	> The pre-Wave-14b TOML/`hint_weight`/`replay_weight` sketch below is
	> preserved as historical proposal context.

	[trainer.loss]
	type = "custom"
	import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
	[trainer.loss.kwargs]
	hint_weight = 0.5
	replay_weight = 0.25
	replay_logits_path = "/data/teachers/precomputed_replay.zarr"

	[teacher]
	model = "Qwen/Qwen3-32B" # same as policy = self-teacher for hint-distill
	hint_template = "composer.hint_v1"

	[orchestrator]
	sync_mode = "async"
	shardcast = true
	```

	`composer_replication/recipes/prime_rl/composer_loss.py` (~120 LOC; current Wave 14b
	implementation defines `loss_fn(inputs, **kwargs)` rather than the
	`composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)`
	signature sketched below):

	```python
	# composer_replication/recipes/prime_rl/composer_loss.py — sketch only;
	# the actual signature evolved during Wave 14b. See module docstring for
	# the current `loss_fn` contract.
	from prime_rl.trainer.rl.loss import LossInputs, LossOutputs

	def composer_three_channel_loss(
	li: LossInputs,
	*,
	hint_weight: float,
	replay_weight: float,
	replay_logits_handle: str,
	) -> LossOutputs:
	# 1. RLVR via GRPO surrogate
	rlvr = grpo_surrogate(li.trainer_logprobs, li.inference_logprobs,
	li.advantages, li.loss_mask)

	# 2. Hint-distill: KL(policy \|\| hint-conditioned teacher)
	hint = masked_kl(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)

	# 3. Trace-replay: KL(policy \|\| precomputed multi-teacher mixture)
	replay = trace_replay_kl(li.trainer_logprobs, replay_logits_handle, li.loss_mask)

	total = rlvr + hint_weight * hint + replay_weight * replay
	return LossOutputs(
	loss=total,
	metrics={"rlvr": rlvr.item(), "hint": hint.item(), "replay": replay.item()},
	)
	```

	Plus `docs/recipes/composer_v0_prime_rl.md` (~50 LOC) describing data layout, teacher precomputation, and reproducibility hashes.

	Total: ~200 LOC of code + ~30 LOC config + ~50 LOC docs ≈ 280 LOC.

	### 6.2 Monarch wrap-up sketch (v0.2)

	```python
	# composer_replication/orchestrator/monarch_runner.py (~120 LOC)
	from monarch.actor import Actor, endpoint
	from monarch.proc_mesh import this_host, ProcMesh

	class TrainerActor(Actor):
	@endpoint
	async def step(self, batch): ...

	class GeneratorActor(Actor):
	@endpoint
	async def generate(self, prompts): ...

	class RewarderActor(Actor):
	@endpoint
	async def score(self, traj): ...

	async def main(cfg):
	train_mesh = await this_host().spawn_procs(TrainerActor, hosts=4, gpus=8)
	gen_mesh = await this_host().spawn_procs(GeneratorActor, hosts=2, gpus=8)
	rew_mesh = await this_host().spawn_procs(RewarderActor, hosts=1, gpus=2)

	async for step in range(cfg.steps):
	prompts = await env.batch()
	traj = await gen_mesh.generate.broadcast(prompts)
	rewards = await rew_mesh.score.broadcast(traj)
	await train_mesh.step.broadcast({"traj": traj, "rewards": rewards})
	```

	Total: ~120 LOC controller + ~50 LOC ops (K8s operator manifest) + ~80 LOC recipe doc ≈ 250 LOC.

	---

	## 7. Sources

	### Primary

	- OpenRLHF — https://github.com/OpenRLHF/OpenRLHF (README, Releases v0.9.10), Apache-2.0; DeepWiki: `openrlhf/models/loss.py`, `agent_func_path`.
	- PRIME-RL — https://github.com/PrimeIntellect-ai/prime-rl (README, Releases v0.5.0), Apache-2.0; DeepWiki: `src/prime_rl/trainer/rl/loss.py`, `CustomLossConfig`, `LossInputs`/`LossOutputs`, `verifiers` integration.
	- NeMo-Aligner — https://github.com/NVIDIA/NeMo-Aligner, Apache-2.0; DeepWiki: PPO/REINFORCE/DPO/RPO; `loss_func` on Megatron model classes.
	- Unsloth — https://github.com/unslothai/unsloth, README RL section; DeepWiki: `patch_trl_rl_trainers()`, `unsloth_zoo` kernels, DAPO loss-type switch.
	- LLaMA-Factory — https://github.com/hiyouga/LLaMA-Factory, Apache-2.0; DeepWiki: `CustomPPOTrainer`/`CustomDPOTrainer`, EasyR1 reference for GRPO.
	- DeepSpeed-Chat — https://github.com/deepspeedai/DeepSpeedExamples (`applications/DeepSpeed-Chat/`), Apache-2.0; DeepWiki: 3-stage PPO, DPO; "Latest News" cutoff Aug 2023; 2025 PRs (#6982, #7015, #7052) confirming maintenance-only mode.
	- Monarch — https://github.com/meta-pytorch/monarch, BSD-3; PyPI `torchmonarch` v0.4.1 (2026-04-08), v0.5.0 dev wheels through 2026-05-05; DeepWiki: `ProcMesh`, `ActorMesh`, `monarch.spmd.SPMDActor`.
	- TorchTitan — https://github.com/pytorch/torchtitan, BSD-3; DeepWiki: FSDP2/TP/PP/CP, `torchtitan/experiments/rl/simple_grpo_sum_digits.py`, integration with vLLM and Monarch.
	- TorchForge — https://github.com/meta-pytorch/forge, BSD-3, repo banner "development paused — consolidating in TorchTitan".
	- torchchat — https://github.com/pytorch/torchchat, BSD-3; DeepWiki: inference-only (eager / `torch.compile` / AOT Inductor / ExecuTorch).

	### Companion repository docs (already present)

	- `~/wiki/research/post-training-framework/04-verl-trl.md` — VeRL vs TRL deep dive.
	- `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` — full Meta-stack survey.
	- `~/wiki/research/post-training-framework/02-diloco-family.md` — DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2.
	- `~/wiki/projects/composer-replication-framework.md` — current TL;DR and stage plan.

	### Notes on accuracy

	- "DAPO" labeling: OpenRLHF and Unsloth both advertise DAPO as a first-class loss type; PRIME-RL implements a DAPO-equivalent (decoupled-clip + KL) but uses the internal name `DPPO+KL` in its default loss. For our purposes this is the same family.
	- Last-commit dates and release versions are pulled from GitHub release pages (OpenRLHF, PRIME-RL) and PyPI release history (`torchmonarch`).
	- Star counts and contributor counts reflect the snapshots returned by web search at the time of writing (May 2026) and will drift; the relative ordering is stable.