Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # RL Post-Training Frameworks Landscape & Meta PyTorch Stack Audit | |
| > **Generated:** 2026-05-25 | |
| > **Scope:** Audit of RL post-training frameworks beyond TRL+VeRL plus Meta's PyTorch agentic stack components, with a recommendation of two additions to the Composer Replication Framework. | |
| > **Feeds:** ADR-006 (Algorithm-substrate selection) | |
| > **Companion docs:** `~/wiki/research/post-training-framework/04-verl-trl.md`, `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md`, `~/wiki/research/post-training-framework/02-diloco-family.md` | |
| --- | |
| ## TL;DR — Recommendation | |
| | Slot | Pick | Why | | |
| |---|---|---| | |
| | **RL framework #3 (after TRL, VeRL)** | **PRIME-RL (PrimeIntellect-ai/prime-rl)** | First-class `CustomLossConfig` extension point (`trainer.loss.type=custom` + `import_path`) — the cleanest place we have to drop our **3-channel loss (RLVR + hint-distill + trace-replay)** without forking. Already uses the `verifiers` env protocol that bridges to OpenEnv. Async, decentralized substrate. Apache-2.0. INTELLECT-2 production receipts. | | |
| | **Infra component (Meta stack)** | **Monarch (`meta-pytorch/monarch`)** as the actor-mesh control plane; **TorchTitan** is *also* tracked as the FSDP2/TP/PP training core but is already the trainer inside both PRIME-RL and TorchForge, so we adopt it transitively. The single net-new dependency is **Monarch**. | Monarch is the only Meta-stack component that is (a) actively shipped (v0.4 GA, v0.5 dev, weekly wheels), (b) decoupled from the now-paused TorchForge, and (c) able to host *any* SPMD trainer (TRL, VeRL, PRIME-RL) as an `ActorMesh`. BSD-3. Replaces Ray when our v0.2 lands. | | |
| **What we do NOT add:** | |
| - OpenRLHF — strong production framework (v0.9.10, 9.3K★, supports DAPO) but its custom-loss path requires modifying `openrlhf/models/loss.py` + a `Trainer` subclass. Strictly worse extension story than PRIME-RL for our specific need (3-channel loss). | |
| - NeMo-Aligner — no GRPO, no DAPO, heavy NeMo/Megatron dependency. Wrong shape. | |
| - Unsloth — TRL wrapper, RL kernels live in closed `unsloth_zoo`. We'd have to fork. | |
| - LLaMA-Factory — TRL wrapper, no GRPO/DAPO (delegates to EasyR1). | |
| - DeepSpeed-Chat — effectively unmaintained for new RL algos since Aug 2023; PPO/DPO only. | |
| - TorchForge — Meta has marked the repo "development paused, consolidating into TorchTitan." Borrow patterns; do not depend on it. | |
| - torchchat — inference / local deployment only; no training. Out of scope. | |
| --- | |
| ## Table of Contents | |
| 1. [Audit Methodology](#1-audit-methodology) | |
| 2. [RL Framework Audit](#2-rl-framework-audit) | |
| 1. [OpenRLHF](#21-openrlhf) | |
| 2. [PRIME-RL](#22-prime-rl) | |
| 3. [NeMo-Aligner](#23-nemo-aligner) | |
| 4. [Unsloth (RL)](#24-unsloth-rl) | |
| 5. [LLaMA-Factory](#25-llama-factory) | |
| 6. [DeepSpeed-Chat](#26-deepspeed-chat) | |
| 3. [Meta PyTorch Agentic Stack — Infra vs Training Split](#3-meta-pytorch-agentic-stack) | |
| 1. [Monarch (coordination/infra)](#31-monarch) | |
| 2. [TorchTitan (training stack)](#32-torchtitan) | |
| 3. [TorchForge (paused)](#33-torchforge) | |
| 4. [torchchat (out of scope)](#34-torchchat) | |
| 4. [Comparison Matrix](#4-comparison-matrix) | |
| 5. [Recommendation Rationale](#5-recommendation-rationale) | |
| 6. [Integration Sketches](#6-integration-sketches) | |
| 7. [Sources](#7-sources) | |
| --- | |
| ## 1. Audit Methodology | |
| For each framework, we capture five fields that determine whether it can host the Composer Replication Framework's three-channel loss (RLVR + hint-distill + trace-replay) on our existing OpenEnv-compatible TRL data path: | |
| 1. **Repo + license + last commit + maturity** — primary GitHub source, license grade for redistribution, recency, and whether the project is *production*, *research*, or *archived*. | |
| 2. **Algorithm coverage** — does it ship GRPO and DAPO out of the box? (DAPO matters because Composer-style training inherits its decoupled clip + dynamic sampling fixes for length and std biases.) | |
| 3. **Custom-loss extension point** — concrete file/class/config where a custom 3-channel loss can be plugged. We strongly prefer a stable public hook over forking. | |
| 4. **Integration cost** — rough lines of code needed for a `Recipe` doc + a skeleton `Trainer` subclass that runs end-to-end on a small env. | |
| 5. **OpenEnv data-path fit** — does it already consume the OpenEnv contract (typed `reset`/`step`/`close`, MCP tool-calling) directly, or do we have to write a shim? | |
| Primary sources: each repo's `README.md`, official releases page, and DeepWiki audits (where indexed). Secondary checks: PyPI release timelines for Meta packages. | |
| --- | |
| ## 2. RL Framework Audit | |
| ### 2.1 OpenRLHF | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/OpenRLHF/OpenRLHF | | |
| | **License** | Apache-2.0 | | |
| | **Stars / contributors** | 9,312 ★ / 90 contributors | | |
| | **Latest release** | v0.9.10, 2026-04-04 | | |
| | **Last push** | 2026-04-05 | | |
| | **Maturity** | **Production** — used in many public RLHF runs since 2023; tagline "An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)" | | |
| | **Algorithms** | PPO, GRPO, **DAPO** (release notes; advertised as a primary feature in v0.9.x), REINFORCE++, REINFORCE++-baseline, RLOO, GSPO, Async RL, TIS (truncated importance sampling) | | |
| | **Custom-loss extension point** | `openrlhf/models/loss.py` — `PolicyLoss`, `DPOLoss`, `SFTLoss`, `PairWiseLoss`, `LogExpLoss` are concrete `nn.Module`s. To add a 3-channel loss you would (a) add a new `nn.Module` (e.g. `ThreeChannelLoss`) here, then (b) subclass the relevant `Trainer` (e.g. `PPOTrainer` / a new GRPO-derived trainer) and replace `self.loss_fn`. There is **no config-driven custom-loss hook** equivalent to PRIME-RL's `CustomLossConfig` — you fork or vendor. | | |
| | **Integration cost** | Higher than PRIME-RL. Estimated **~400–600 LOC**: ~150 LOC for a `ThreeChannelLoss` module, ~200 LOC for a `ComposerGRPOTrainer` subclass that routes the three signals (RLVR scalar, hint-distill teacher logprobs, trace-replay teacher logits), ~50 LOC for a `Recipe` doc, plus reward-fn glue. | | |
| | **Data-path fit** | OpenRLHF's input is HF chat templates + a Python reward function or a remote reward URL (`--reward.remote_url`, `--train.agent_func_path`). It does **not** speak the OpenEnv `reset/step` protocol natively, but our existing OpenEnv→TRL adapter could be reused as a callable behind `agent_func_path`. **Medium** lift to wire OpenEnv. | | |
| **Verdict:** Strong, mature, well-funded codebase with the *most* complete algorithm coverage of any candidate. Loses to PRIME-RL only because PRIME-RL has a first-class config-driven custom-loss hook that fits our exact need, and PRIME-RL already has the `verifiers`/OpenEnv shape baked into the orchestrator. We keep OpenRLHF on the radar as a fallback substrate if PRIME-RL's decentralized story is overkill for v0.1. | |
| --- | |
| ### 2.2 PRIME-RL | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/PrimeIntellect-ai/prime-rl | | |
| | **License** | Apache-2.0 | | |
| | **Stars / contributors** | 1,398 ★ / 60 contributors | | |
| | **Latest release** | v0.5.0, 2026-03-30 | | |
| | **Last push** | 2026-05-25 (active today) | | |
| | **Maturity** | **Production-research hybrid** — substrate behind INTELLECT-1/2 multi-DC runs; tagline "Async RL Training at Scale". Decentralized DiLoCo-shape compute is its differentiator. | | |
| | **Algorithms** | **GRPO**, GSPO, on-policy distillation with a teacher model. `default_loss_fn` = DPPO + KL (a GRPO variant; similar lineage to DAPO's decoupled-clip idea but the upstream "DAPO" label is not used verbatim). | | |
| | **Custom-loss extension point** | **Best in class.** `src/prime_rl/trainer/rl/loss.py` exposes a `LossInputs`/`LossOutputs` interface and `setup_loss_fn` resolves a config: `trainer.loss.type = "custom"` + `trainer.loss.import_path = "your_pkg.your_module.your_loss_fn"` + optional kwargs. The custom function receives `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask` — i.e., the exact tensor inputs needed for a 3-channel loss (RLVR uses `advantages`, hint-distill uses `teacher_logprobs`, trace-replay can be threaded through `kwargs` as a precomputed reference). | | |
| | **Integration cost** | **Lowest.** Estimated **~200–300 LOC total**: ~120 LOC for a `composer_three_channel_loss` function in our package + ~30 LOC of config (`recipes/composer_v0.toml`), ~80 LOC `Recipe` doc. No subclassing required for the loss. A small adapter is needed if we precompute the trace-replay teacher distribution outside the `LossInputs` struct. | | |
| | **Data-path fit** | **Already aligned.** PRIME-RL's orchestrator consumes `verifiers` environments via `vf.EnvServer`. The OpenEnv ↔ verifiers shim is a known small adapter (the `verifiers` library is the Hub-side env runner that OpenEnv's TRL guide already uses). Our existing OpenEnv-compatible TRL data path drops in with a thin wrapper. | | |
| **Verdict:** Best fit for the framework. The combination of (i) config-driven custom loss with the right tensor signatures already present, (ii) verifiers/OpenEnv shape, (iii) decentralized async training that maps to our DiLoCo plans, makes PRIME-RL the substrate of choice for v0.1. **Recommended addition #1.** | |
| --- | |
| ### 2.3 NeMo-Aligner | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/NVIDIA/NeMo-Aligner | | |
| | **License** | Apache-2.0 | | |
| | **Maturity** | **Research-leaning production** — NVIDIA-maintained, tied to NeMo/Megatron-LM. Advertised as "early stages of development" in its own README. | | |
| | **Algorithms** | PPO, REINFORCE, RS (Rejection Sampling), DPO, RPO. **No GRPO. No DAPO.** | | |
| | **Custom-loss extension point** | `loss_func` method on Megatron model classes (e.g. `MegatronGPTDPOModel.loss_func`). Requires NeMo model-class subclassing and Megatron-LM familiarity. | | |
| | **Integration cost** | High. Estimated **~800–1,200 LOC** including .nemo conversion of HF weights, Megatron model wrapping, custom Megatron `loss_func`, and a recipe. Plus the operational cost of running on Megatron-LM (Triton kernels, NeMo container). | | |
| | **Data-path fit** | JSONL only; no OpenEnv. We'd write a full env adapter. | | |
| **Verdict:** Wrong shape. No GRPO/DAPO and tightly bound to the NeMo ecosystem. Only relevant if we ever need NVIDIA-supported large-scale Megatron RL, which we don't for the Composer Replication v0.1/v0.2 horizon. **Reject.** | |
| --- | |
| ### 2.4 Unsloth (RL) | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/unslothai/unsloth | | |
| | **License** | Apache-2.0 (per public README; not surfaced by DeepWiki snapshot but well-known) | | |
| | **Maturity** | **Production** for SFT and LoRA/QLoRA; **research/preview** for RL — RL support shipped in 2025 as a TRL patcher. | | |
| | **Algorithms** | Wraps TRL → inherits TRL's GRPO; loss-type switch supports `"grpo"`, `"bnpo"`, `"dr_grpo"`, `"dapo"`, `"cispo"`. So **GRPO and DAPO are both available** through the patched-TRL path. | | |
| | **Custom-loss extension point** | Problematic. The actual loss kernels live in `unsloth_zoo` (a *separate* compiled dependency). The patcher (`patch_trl_rl_trainers()`) generates modified TRL trainer classes via `exec()` from string templates. To add a new loss type you would have to (a) modify or fork `unsloth_zoo` to add a kernel, (b) extend `RL_REPLACEMENTS`, and (c) extend the `compute_loss()` switch in the patcher template. **There is no public Python subclass hook that survives the patching.** | | |
| | **Integration cost** | Very high if we want our own loss. Forking `unsloth_zoo` defeats the purpose of using Unsloth (which is the optimized kernels). Estimated ~1,000+ LOC plus an external repo to maintain. | | |
| | **Data-path fit** | TRL-shaped, so OpenEnv via TRL is fine — but only for *stock* TRL losses. Our 3-channel loss does not survive Unsloth's patching. | | |
| **Verdict:** Excellent for memory-efficient SFT and stock-GRPO LoRA. Wrong tool for a custom loss. **Reject** as the substrate; we may still use it as an *optional* QLoRA accelerator inside a stock-GRPO ablation run. | |
| --- | |
| ### 2.5 LLaMA-Factory | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/hiyouga/LLaMA-Factory | | |
| | **License** | Apache-2.0 | | |
| | **Maturity** | **Production** for breadth (50+ model families, SFT/DPO/PPO recipes), but RL is a thin TRL wrapper. | | |
| | **Algorithms** | PPO, DPO, KTO, ORPO, SimPO via `Custom*Trainer` subclasses of the corresponding `trl.*Trainer` classes. **No GRPO. No DAPO** in the repo itself; the README points to **EasyR1** (an external GRPO framework) for those. | | |
| | **Custom-loss extension point** | `compute_preference_loss` switch on `CustomDPOTrainer` (selects `sigmoid` / `hinge` / `ipo` / `kto_pair` / `orpo` / `simpo`). For PPO, you would subclass `CustomPPOTrainer` → which is `trl.PPOTrainer`. Effectively the same extension story as plain TRL, with a configuration layer on top. | | |
| | **Integration cost** | Moderate, ~400 LOC, but you are essentially using TRL through one extra layer. | | |
| | **Data-path fit** | Text/dataset-shaped, not OpenEnv-aware. Same OpenEnv-via-TRL story. | | |
| **Verdict:** Useful as a multi-model SFT laboratory but does not move the ball for our RL-side requirements. **Reject** as substrate; we already have TRL. | |
| --- | |
| ### 2.6 DeepSpeed-Chat | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/deepspeedai/DeepSpeedExamples (the `applications/DeepSpeed-Chat/` subtree) | | |
| | **License** | Apache-2.0 | | |
| | **Maturity** | **Effectively stale.** The README's "Latest News" cuts off in August 2023. CI patches in 2025 (e.g., #6982, #7015, #7052) are dependency-pinning fixes, not feature work. The roadmap to "generalize DeepSpeed-RLHF abstraction for a wider range of RL algorithms" has not landed. | | |
| | **Algorithms** | PPO (3-stage RLHF) + DPO. **No GRPO. No DAPO.** | | |
| | **Custom-loss extension point** | `DeepSpeedPPOTrainer.train_rlhf` / `actor_loss_fn` / `critic_loss_fn`. Editable but not config-hooked. | | |
| | **Integration cost** | Moderate, but you inherit a frozen architecture. ~500 LOC. | | |
| | **Data-path fit** | Prompt-dataset-shaped; no OpenEnv. | | |
| **Verdict:** Pioneering for its time, no longer competitive on algorithm coverage. **Reject.** | |
| --- | |
| ## 3. Meta PyTorch Agentic Stack — Infra vs Training Split | |
| The brief asked specifically to **distinguish coordination/infra from training-stack** components. The answer is: | |
| | Component | Layer | Status (May 2026) | In our framework? | | |
| |---|---|---|---| | |
| | **Monarch** (`meta-pytorch/monarch`) | **Coordination / Infra** — actor mesh, RDMA data plane, supervision trees | **Active.** v0.4 GA (2026-03-26), v0.5 dev wheels daily, BSD-3 | **Yes — recommended addition.** | | |
| | **TorchTitan** (`pytorch/torchtitan`) | **Training stack** — FSDP2 / TP / PP / CP / float8 / MXFP8 | **Active.** BSD-3, "extensive development". Has an experimental GRPO recipe (`experiments/rl/simple_grpo_sum_digits.py`) on Monarch. | **Indirectly** — already the trainer inside PRIME-RL and TorchForge. We adopt it transitively, not as a direct dependency. | | |
| | **TorchForge** (`meta-pytorch/forge`) | RL post-training library | **Development paused** per the repo banner; consolidating into TorchTitan. ~685★. | **Pattern reference only.** Lift the Generator/Trainer/Rewarder *shape* but do not depend on the package. | | |
| | **torchchat** (`pytorch/torchchat`) | **Inference / local deployment** | Active for its own scope, but: not a training framework; no RL surface. | **Out of scope.** | | |
| | **OpenEnv** (`meta-pytorch/OpenEnv`) | Environment standard (covered separately) | Active. Already a v0 dependency of the framework. | Already adopted. | | |
| ### 3.1 Monarch | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/meta-pytorch/monarch | | |
| | **License** | BSD-3-Clause | | |
| | **PyPI** | `torchmonarch`; v0.4.1 stable (2026-04-08), v0.5.0 dev wheels published daily through 2026-05-05 | | |
| | **Maturity** | **Experimental but actively shipped.** "Currently in an experimental stage" per the repo's own status note, but with a functioning K8s operator, weekly wheels, ProcessMesh/ActorMesh APIs stable enough for VeRL backend experiments. | | |
| | **Role in our stack** | **Pure coordination/infra.** It does not train models. It hosts whatever trainer you bring (TRL, VeRL, PRIME-RL, TorchTitan) as `Actor` subclasses on a `ProcMesh`. The `monarch.spmd.SPMDActor` automatically configures `RANK`/`LOCAL_RANK`/`WORLD_SIZE` for any PyTorch-distributed script — i.e., we can lift our existing TRL or PRIME-RL workers into Monarch with minimal change. | | |
| | **Key abstractions** | `ProcMesh` (processes × hosts × GPUs), `ActorMesh` (typed actors with `@endpoint` methods), supervision trees, RDMA buffers, distributed tensors / DTensor integration. Underlying runtime: `hyperactor` (Rust). | | |
| | **Why over Ray** | Tighter PyTorch/DTensor integration; explicit RDMA data plane (Ray uses object store + standard networking); single-controller mental model maps directly to RL post-training (one controller orchestrates Generator + Trainer + Rewarder + Env actors). | | |
| | **Integration cost into Composer Replication** | **~300 LOC + ops**: (a) wrap our PRIME-RL trainer as an `SPMDActor`; (b) wrap our vLLM rollout server as an `Actor` with an `@endpoint generate(prompts)` method; (c) write a single controller script that creates a `ProcMesh`, spawns both meshes, and shuttles `DataProto`-shaped messages; (d) Recipe doc. The ops cost is the harder half — Monarch's K8s operator is new (v0.2.0+). | | |
| | **Risk** | Pre-1.0; API churn possible (e.g., `KubernetesJob.add_mesh` signature changed in v0.5). Mitigation: pin to `torchmonarch==0.4.1` for v0.2 of our framework. | | |
| ### 3.2 TorchTitan | |
| | Field | Value | | |
| |---|---| | |
| | **Repo** | https://github.com/pytorch/torchtitan | | |
| | **License** | BSD-3-Clause | | |
| | **Maturity** | **Active development** for pretraining; **experimental** for RL. The GRPO experiment (`torchtitan/experiments/rl/simple_grpo_sum_digits.py`) is in `experiments/`, which the repo explicitly disclaims as removable. | | |
| | **Role** | **Training stack only.** Provides FSDP2 (per-parameter sharding), Tensor Parallel (incl. async TP), Pipeline Parallel (zero-bubble), Context Parallel (long-context), `torch.compile`, Float8, MXFP8, DDP, HSDP. | | |
| | **OpenEnv-aware?** | No, but the experimental `RLTrainer` integrates `vLLM` + Monarch actors, which is the same shape PRIME-RL uses. | | |
| | **Why we don't add it directly** | **PRIME-RL already uses TorchTitan-equivalent FSDP2 internals**, and TorchForge's training core was TorchTitan. Adding TorchTitan as a *direct* dependency would mean writing our own RL loop on top of it — that's TorchForge's job, and Meta paused exactly that effort. The right move is to depend on PRIME-RL, which has battle-tested distributed training patterns equivalent to TorchTitan's, and revisit TorchTitan directly only when we genuinely need its experimental zero-bubble PP or MXFP8 paths. | | |
| ### 3.3 TorchForge (Paused) | |
| - Repo banner: **"Development paused — LLM training consolidating in TorchTitan."** | |
| - ~685 ★, 100+ open issues, last meaningful release in early 2026. | |
| - Patterns we should still copy: | |
| - Generator/Trainer/Rewarder ActorMesh decomposition | |
| - TorchStore-style RDMA weight broadcast | |
| - Async toggle between sync PPO-like and fully async off-policy | |
| - **We do not add a TorchForge dependency.** Architectural reference only. | |
| ### 3.4 torchchat (Out of Scope) | |
| - Inference / local deployment of LLMs (Eager / `torch.compile` / AOT Inductor / ExecuTorch / mobile). | |
| - No training, no RL. | |
| - Mentioned in the brief for completeness; ruled out cleanly. | |
| --- | |
| ## 4. Comparison Matrix | |
| ### 4.1 RL Frameworks | |
| | Framework | License | Last release | Maturity | GRPO | DAPO | Custom-loss hook | OpenEnv fit | Est. integration LOC | | |
| |---|---|---|---|---|---|---|---|---| | |
| | **TRL** (baseline) | Apache-2.0 | Active | Production | ✅ | partial (tricks land per release) | Subclass `GRPOTrainer.compute_loss` | ✅ native (Oct 2025 OpenEnv guide) | already integrated | | |
| | **VeRL** (baseline) | Apache-2.0 | Active | Production | ✅ | ✅ | `core_algos.py` + worker subclass | shim via Ray dataloader | already skeleton | | |
| | **OpenRLHF** | Apache-2.0 | v0.9.10 (2026-04-04) | Production | ✅ | ✅ | `openrlhf/models/loss.py` + Trainer subclass; **no config hook** | shim via `agent_func_path` | ~400–600 | | |
| | **PRIME-RL** ⭐ | Apache-2.0 | v0.5.0 (2026-03-30) | Prod-research | ✅ | partial (DPPO+KL variant; not labeled DAPO) | **`CustomLossConfig` import_path — first-class** | ✅ via `verifiers` (OpenEnv-compatible) | **~200–300** | | |
| | **NeMo-Aligner** | Apache-2.0 | Active | Research-leaning | ❌ | ❌ | Megatron model `loss_func` | none; JSONL only | ~800–1,200 | | |
| | **Unsloth (RL)** | Apache-2.0 | Active | Production (SFT) / preview (RL) | ✅ (via TRL patch) | ✅ (via TRL patch) | Loss kernels in closed `unsloth_zoo`; effectively unhookable | TRL-shaped | ~1,000+ (forking) | | |
| | **LLaMA-Factory** | Apache-2.0 | Active | Production | ❌ (delegates to EasyR1) | ❌ | TRL `Custom*Trainer` subclass | TRL-shaped | ~400 | | |
| | **DeepSpeed-Chat** | Apache-2.0 | Stale (Aug 2023 features; 2025 only CI fixes) | Effectively maintained-only | ❌ | ❌ | `DeepSpeedPPOTrainer` subclass | none | ~500 | | |
| ### 4.2 Meta PyTorch Stack | |
| | Component | Layer | License | Status | In recommendation? | | |
| |---|---|---|---|---| | |
| | **Monarch** ⭐ | Coordination / actor mesh | BSD-3 | Active (v0.4 GA, v0.5 dev) | **Yes** | | |
| | **TorchTitan** | Training stack | BSD-3 | Active; RL experimental | Indirect (via PRIME-RL) | | |
| | **TorchForge** | RL library | BSD-3 | **Paused** | No — patterns only | | |
| | **torchchat** | Inference / deployment | BSD-3 | Active | No — out of scope | | |
| | **OpenEnv** | Environment standard | (Hub) | Active | Already adopted | | |
| --- | |
| ## 5. Recommendation Rationale | |
| ### 5.1 Why PRIME-RL, not OpenRLHF | |
| OpenRLHF is in many ways the safer pick: more stars, more contributors, more algorithm coverage (it explicitly ships DAPO). The deciding factor is **the shape of our custom loss**. | |
| The Composer Replication Framework's signature contribution is the **three-channel reward**: | |
| 1. **RLVR** — tests-pass scalar from the OpenEnv environment. | |
| 2. **Composer-style hint-distill (SDPO/OPSD)** — the model self-teaches against its own hint-conditioned roll-outs; needs `teacher_logprobs` aligned to the rollout token grid. | |
| 3. **Trace-replay multi-teacher PRM** (the novel bit) — N frozen external teachers' precomputed token-level distributions, replayed against the on-policy rollout. | |
| PRIME-RL's `LossInputs` dataclass already exposes exactly the tensors we need: | |
| ``` | |
| trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask | |
| ``` | |
| A custom 3-channel loss is roughly: | |
| ```python | |
| def composer_three_channel_loss(li: LossInputs, *, hint_weight, replay_weight, replay_logits) -> LossOutputs: | |
| rlvr = grpo_term(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask) | |
| hint = kl_term(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask) | |
| replay = kl_term(li.trainer_logprobs, replay_logits, li.loss_mask) | |
| return LossOutputs(loss=rlvr + hint_weight * hint + replay_weight * replay, ...) | |
| ``` | |
| We register this with `trainer.loss.type = "custom"` + `import_path` and we're done. No subclassing, no `exec()`-patched template, no Megatron model wrapping. | |
| OpenRLHF would require us to (a) add a `ThreeChannelLoss` `nn.Module` to `openrlhf/models/loss.py`, (b) subclass `PPOTrainer` (or equivalent GRPO trainer) to construct it with the right teacher-logprob plumbing, and (c) carry that fork forward. ~2× the LOC, plus a fork to maintain. | |
| A second factor: PRIME-RL's `verifiers` env protocol is a direct precursor of OpenEnv's wire shape (HTTP/WebSocket env servers, typed observations). Our existing OpenEnv-compatible TRL data path translates with a thin adapter. OpenRLHF's `agent_func_path` is more of an escape hatch than a contract. | |
| A third factor: PRIME-RL was *built for decentralized training* (INTELLECT-1/2). Even though our v0.1 stays on a single cluster, the v0.2 multi-DC story drops in cleanly. OpenRLHF is Ray-on-one-cluster by design. | |
| ### 5.2 Why Monarch, not TorchTitan or TorchForge | |
| Among the four Meta-stack components in the brief, only one is both (a) ours to add and (b) genuinely new functionality: | |
| - **TorchForge** is paused — depending on it now is a known dead end. | |
| - **TorchTitan** is already inside PRIME-RL transitively (PRIME-RL uses FSDP2 plus a SHARDCAST weight-broadcast layer that is morally equivalent to what TorchTitan offers). Adding TorchTitan as a *direct* dependency means writing our own RL loop on top of it, which is exactly what TorchForge tried and paused. We get TorchTitan's benefits without owning the integration. | |
| - **torchchat** is for local inference / mobile deployment — out of scope. | |
| - **Monarch** is the unique value: a PyTorch-native actor mesh that lets us replace Ray (PRIME-RL's current orchestration substrate) with something that has explicit RDMA, supervision trees, and ProcMesh/ActorMesh primitives that map directly onto our (Generator, Trainer, Rewarder, EnvServer) topology. | |
| The migration path is incremental: | |
| - **v0.1:** PRIME-RL on Ray (current). Monarch listed as roadmap. | |
| - **v0.2:** Wrap PRIME-RL's Trainer as a `monarch.spmd.SPMDActor`, vLLM Generator as an `Actor` with an `@endpoint generate()`. Switch the orchestrator from `ray.init()` to `this_host().spawn_procs()`. | |
| - Risk-mitigation: pin to `torchmonarch==0.4.1` (the last GA release before v0.5 dev). Keep a Ray fallback path active until v0.2 is stable. | |
| --- | |
| ## 6. Integration Sketches | |
| ### 6.1 PRIME-RL Recipe skeleton | |
| `recipes/composer_v0_prime_rl.toml` (~30 LOC): | |
| ```toml | |
| # composer_v0_prime_rl.toml | |
| [model] | |
| name = "Qwen/Qwen3-32B" # or Kimi-K2.5 when MoE support lands | |
| [data] | |
| env = "swe_bench_lite" # via verifiers EnvServer; wraps our OpenEnv adapter | |
| batch_size = 64 | |
| group_size = 16 | |
| [trainer] | |
| algorithm = "grpo" | |
| > **Realised in v0.1 (Wave 17 update):** Wave 14b shipped the PRIME-RL | |
| > recipe at `composer_replication/recipes/prime_rl/prime_rl_config.yaml` | |
| > as **YAML** with a different kwarg surface than the TOML sketch below. | |
| > The actual recipe shape: | |
| > | |
| > ```yaml | |
| > # composer_replication/recipes/prime_rl/prime_rl_config.yaml | |
| > model: | |
| > base: "Qwen/Qwen2.5-0.5B" | |
| > attn_implementation: "flash_attention_2" | |
| > dtype: "bfloat16" | |
| > env: | |
| > protocol: "verifiers" | |
| > config: { name: "math/gsm8k", split: "train" } | |
| > loss: | |
| > custom: | |
| > import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn" | |
| > kwargs: | |
| > alpha_sdpo: 0.0 # channel 2 deferred in v0 | |
| > beta_dpo: 0.0 # channel 3 out-of-scope for PRIME-RL v0 | |
| > dppo_mask_high: 0.2 # PRIME-RL DPPO convention (NOT textbook PPO) | |
| > dppo_mask_low: 0.2 # both must be >= 0 per Field(..., ge=0) | |
| > adv_tau: 1.0 # advantage normalization | |
| > kl_tau: 0.04 # KL coefficient | |
| > ``` | |
| > | |
| > The realised `loss_fn(inputs, **kwargs)` matches PRIME-RL's | |
| > `LossInputs`/`LossOutputs` interface (read upstream `prime_rl/loss.py` | |
| > for parity verification — Wave 14b's shadow-parity test independently | |
| > restates the formula in | |
| > `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`). | |
| > | |
| > The pre-Wave-14b TOML/`hint_weight`/`replay_weight` sketch below is | |
| > preserved as historical proposal context. | |
| [trainer.loss] | |
| type = "custom" | |
| import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn" | |
| [trainer.loss.kwargs] | |
| hint_weight = 0.5 | |
| replay_weight = 0.25 | |
| replay_logits_path = "/data/teachers/precomputed_replay.zarr" | |
| [teacher] | |
| model = "Qwen/Qwen3-32B" # same as policy = self-teacher for hint-distill | |
| hint_template = "composer.hint_v1" | |
| [orchestrator] | |
| sync_mode = "async" | |
| shardcast = true | |
| ``` | |
| `composer_replication/recipes/prime_rl/composer_loss.py` (~120 LOC; current Wave 14b | |
| implementation defines `loss_fn(inputs, **kwargs)` rather than the | |
| `composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)` | |
| signature sketched below): | |
| ```python | |
| # composer_replication/recipes/prime_rl/composer_loss.py — sketch only; | |
| # the actual signature evolved during Wave 14b. See module docstring for | |
| # the current `loss_fn` contract. | |
| from prime_rl.trainer.rl.loss import LossInputs, LossOutputs | |
| def composer_three_channel_loss( | |
| li: LossInputs, | |
| *, | |
| hint_weight: float, | |
| replay_weight: float, | |
| replay_logits_handle: str, | |
| ) -> LossOutputs: | |
| # 1. RLVR via GRPO surrogate | |
| rlvr = grpo_surrogate(li.trainer_logprobs, li.inference_logprobs, | |
| li.advantages, li.loss_mask) | |
| # 2. Hint-distill: KL(policy || hint-conditioned teacher) | |
| hint = masked_kl(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask) | |
| # 3. Trace-replay: KL(policy || precomputed multi-teacher mixture) | |
| replay = trace_replay_kl(li.trainer_logprobs, replay_logits_handle, li.loss_mask) | |
| total = rlvr + hint_weight * hint + replay_weight * replay | |
| return LossOutputs( | |
| loss=total, | |
| metrics={"rlvr": rlvr.item(), "hint": hint.item(), "replay": replay.item()}, | |
| ) | |
| ``` | |
| Plus `docs/recipes/composer_v0_prime_rl.md` (~50 LOC) describing data layout, teacher precomputation, and reproducibility hashes. | |
| **Total: ~200 LOC of code + ~30 LOC config + ~50 LOC docs ≈ 280 LOC.** | |
| ### 6.2 Monarch wrap-up sketch (v0.2) | |
| ```python | |
| # composer_replication/orchestrator/monarch_runner.py (~120 LOC) | |
| from monarch.actor import Actor, endpoint | |
| from monarch.proc_mesh import this_host, ProcMesh | |
| class TrainerActor(Actor): | |
| @endpoint | |
| async def step(self, batch): ... | |
| class GeneratorActor(Actor): | |
| @endpoint | |
| async def generate(self, prompts): ... | |
| class RewarderActor(Actor): | |
| @endpoint | |
| async def score(self, traj): ... | |
| async def main(cfg): | |
| train_mesh = await this_host().spawn_procs(TrainerActor, hosts=4, gpus=8) | |
| gen_mesh = await this_host().spawn_procs(GeneratorActor, hosts=2, gpus=8) | |
| rew_mesh = await this_host().spawn_procs(RewarderActor, hosts=1, gpus=2) | |
| async for step in range(cfg.steps): | |
| prompts = await env.batch() | |
| traj = await gen_mesh.generate.broadcast(prompts) | |
| rewards = await rew_mesh.score.broadcast(traj) | |
| await train_mesh.step.broadcast({"traj": traj, "rewards": rewards}) | |
| ``` | |
| **Total: ~120 LOC controller + ~50 LOC ops (K8s operator manifest) + ~80 LOC recipe doc ≈ 250 LOC.** | |
| --- | |
| ## 7. Sources | |
| ### Primary | |
| - **OpenRLHF** — https://github.com/OpenRLHF/OpenRLHF (README, Releases v0.9.10), Apache-2.0; DeepWiki: `openrlhf/models/loss.py`, `agent_func_path`. | |
| - **PRIME-RL** — https://github.com/PrimeIntellect-ai/prime-rl (README, Releases v0.5.0), Apache-2.0; DeepWiki: `src/prime_rl/trainer/rl/loss.py`, `CustomLossConfig`, `LossInputs`/`LossOutputs`, `verifiers` integration. | |
| - **NeMo-Aligner** — https://github.com/NVIDIA/NeMo-Aligner, Apache-2.0; DeepWiki: PPO/REINFORCE/DPO/RPO; `loss_func` on Megatron model classes. | |
| - **Unsloth** — https://github.com/unslothai/unsloth, README RL section; DeepWiki: `patch_trl_rl_trainers()`, `unsloth_zoo` kernels, DAPO loss-type switch. | |
| - **LLaMA-Factory** — https://github.com/hiyouga/LLaMA-Factory, Apache-2.0; DeepWiki: `CustomPPOTrainer`/`CustomDPOTrainer`, EasyR1 reference for GRPO. | |
| - **DeepSpeed-Chat** — https://github.com/deepspeedai/DeepSpeedExamples (`applications/DeepSpeed-Chat/`), Apache-2.0; DeepWiki: 3-stage PPO, DPO; "Latest News" cutoff Aug 2023; 2025 PRs (#6982, #7015, #7052) confirming maintenance-only mode. | |
| - **Monarch** — https://github.com/meta-pytorch/monarch, BSD-3; PyPI `torchmonarch` v0.4.1 (2026-04-08), v0.5.0 dev wheels through 2026-05-05; DeepWiki: `ProcMesh`, `ActorMesh`, `monarch.spmd.SPMDActor`. | |
| - **TorchTitan** — https://github.com/pytorch/torchtitan, BSD-3; DeepWiki: FSDP2/TP/PP/CP, `torchtitan/experiments/rl/simple_grpo_sum_digits.py`, integration with vLLM and Monarch. | |
| - **TorchForge** — https://github.com/meta-pytorch/forge, BSD-3, repo banner "development paused — consolidating in TorchTitan". | |
| - **torchchat** — https://github.com/pytorch/torchchat, BSD-3; DeepWiki: inference-only (eager / `torch.compile` / AOT Inductor / ExecuTorch). | |
| ### Companion repository docs (already present) | |
| - `~/wiki/research/post-training-framework/04-verl-trl.md` — VeRL vs TRL deep dive. | |
| - `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` — full Meta-stack survey. | |
| - `~/wiki/research/post-training-framework/02-diloco-family.md` — DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2. | |
| - `~/wiki/projects/composer-replication-framework.md` — current TL;DR and stage plan. | |
| ### Notes on accuracy | |
| - "DAPO" labeling: OpenRLHF and Unsloth both advertise DAPO as a first-class loss type; PRIME-RL implements a DAPO-equivalent (decoupled-clip + KL) but uses the internal name `DPPO+KL` in its default loss. For our purposes this is the same family. | |
| - Last-commit dates and release versions are pulled from GitHub release pages (OpenRLHF, PRIME-RL) and PyPI release history (`torchmonarch`). | |
| - Star counts and contributor counts reflect the snapshots returned by web search at the time of writing (May 2026) and will drift; the relative ordering is stable. | |