Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
RL Post-Training Frameworks Landscape & Meta PyTorch Stack Audit
Generated: 2026-05-25 Scope: Audit of RL post-training frameworks beyond TRL+VeRL plus Meta's PyTorch agentic stack components, with a recommendation of two additions to the Composer Replication Framework. Feeds: ADR-006 (Algorithm-substrate selection) Companion docs:
~/wiki/research/post-training-framework/04-verl-trl.md,~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md,~/wiki/research/post-training-framework/02-diloco-family.md
TL;DR — Recommendation
| Slot | Pick | Why |
|---|---|---|
| RL framework #3 (after TRL, VeRL) | PRIME-RL (PrimeIntellect-ai/prime-rl) | First-class CustomLossConfig extension point (trainer.loss.type=custom + import_path) — the cleanest place we have to drop our 3-channel loss (RLVR + hint-distill + trace-replay) without forking. Already uses the verifiers env protocol that bridges to OpenEnv. Async, decentralized substrate. Apache-2.0. INTELLECT-2 production receipts. |
| Infra component (Meta stack) | Monarch (meta-pytorch/monarch) as the actor-mesh control plane; TorchTitan is also tracked as the FSDP2/TP/PP training core but is already the trainer inside both PRIME-RL and TorchForge, so we adopt it transitively. The single net-new dependency is Monarch. |
Monarch is the only Meta-stack component that is (a) actively shipped (v0.4 GA, v0.5 dev, weekly wheels), (b) decoupled from the now-paused TorchForge, and (c) able to host any SPMD trainer (TRL, VeRL, PRIME-RL) as an ActorMesh. BSD-3. Replaces Ray when our v0.2 lands. |
What we do NOT add:
- OpenRLHF — strong production framework (v0.9.10, 9.3K★, supports DAPO) but its custom-loss path requires modifying
openrlhf/models/loss.py+ aTrainersubclass. Strictly worse extension story than PRIME-RL for our specific need (3-channel loss). - NeMo-Aligner — no GRPO, no DAPO, heavy NeMo/Megatron dependency. Wrong shape.
- Unsloth — TRL wrapper, RL kernels live in closed
unsloth_zoo. We'd have to fork. - LLaMA-Factory — TRL wrapper, no GRPO/DAPO (delegates to EasyR1).
- DeepSpeed-Chat — effectively unmaintained for new RL algos since Aug 2023; PPO/DPO only.
- TorchForge — Meta has marked the repo "development paused, consolidating into TorchTitan." Borrow patterns; do not depend on it.
- torchchat — inference / local deployment only; no training. Out of scope.
Table of Contents
- Audit Methodology
- RL Framework Audit
- Meta PyTorch Agentic Stack — Infra vs Training Split
- Comparison Matrix
- Recommendation Rationale
- Integration Sketches
- Sources
1. Audit Methodology
For each framework, we capture five fields that determine whether it can host the Composer Replication Framework's three-channel loss (RLVR + hint-distill + trace-replay) on our existing OpenEnv-compatible TRL data path:
- Repo + license + last commit + maturity — primary GitHub source, license grade for redistribution, recency, and whether the project is production, research, or archived.
- Algorithm coverage — does it ship GRPO and DAPO out of the box? (DAPO matters because Composer-style training inherits its decoupled clip + dynamic sampling fixes for length and std biases.)
- Custom-loss extension point — concrete file/class/config where a custom 3-channel loss can be plugged. We strongly prefer a stable public hook over forking.
- Integration cost — rough lines of code needed for a
Recipedoc + a skeletonTrainersubclass that runs end-to-end on a small env. - OpenEnv data-path fit — does it already consume the OpenEnv contract (typed
reset/step/close, MCP tool-calling) directly, or do we have to write a shim?
Primary sources: each repo's README.md, official releases page, and DeepWiki audits (where indexed). Secondary checks: PyPI release timelines for Meta packages.
2. RL Framework Audit
2.1 OpenRLHF
| Field | Value |
|---|---|
| Repo | https://github.com/OpenRLHF/OpenRLHF |
| License | Apache-2.0 |
| Stars / contributors | 9,312 ★ / 90 contributors |
| Latest release | v0.9.10, 2026-04-04 |
| Last push | 2026-04-05 |
| Maturity | Production — used in many public RLHF runs since 2023; tagline "An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)" |
| Algorithms | PPO, GRPO, DAPO (release notes; advertised as a primary feature in v0.9.x), REINFORCE++, REINFORCE++-baseline, RLOO, GSPO, Async RL, TIS (truncated importance sampling) |
| Custom-loss extension point | openrlhf/models/loss.py — PolicyLoss, DPOLoss, SFTLoss, PairWiseLoss, LogExpLoss are concrete nn.Modules. To add a 3-channel loss you would (a) add a new nn.Module (e.g. ThreeChannelLoss) here, then (b) subclass the relevant Trainer (e.g. PPOTrainer / a new GRPO-derived trainer) and replace self.loss_fn. There is no config-driven custom-loss hook equivalent to PRIME-RL's CustomLossConfig — you fork or vendor. |
| Integration cost | Higher than PRIME-RL. Estimated ~400–600 LOC: ~150 LOC for a ThreeChannelLoss module, ~200 LOC for a ComposerGRPOTrainer subclass that routes the three signals (RLVR scalar, hint-distill teacher logprobs, trace-replay teacher logits), ~50 LOC for a Recipe doc, plus reward-fn glue. |
| Data-path fit | OpenRLHF's input is HF chat templates + a Python reward function or a remote reward URL (--reward.remote_url, --train.agent_func_path). It does not speak the OpenEnv reset/step protocol natively, but our existing OpenEnv→TRL adapter could be reused as a callable behind agent_func_path. Medium lift to wire OpenEnv. |
Verdict: Strong, mature, well-funded codebase with the most complete algorithm coverage of any candidate. Loses to PRIME-RL only because PRIME-RL has a first-class config-driven custom-loss hook that fits our exact need, and PRIME-RL already has the verifiers/OpenEnv shape baked into the orchestrator. We keep OpenRLHF on the radar as a fallback substrate if PRIME-RL's decentralized story is overkill for v0.1.
2.2 PRIME-RL
| Field | Value |
|---|---|
| Repo | https://github.com/PrimeIntellect-ai/prime-rl |
| License | Apache-2.0 |
| Stars / contributors | 1,398 ★ / 60 contributors |
| Latest release | v0.5.0, 2026-03-30 |
| Last push | 2026-05-25 (active today) |
| Maturity | Production-research hybrid — substrate behind INTELLECT-1/2 multi-DC runs; tagline "Async RL Training at Scale". Decentralized DiLoCo-shape compute is its differentiator. |
| Algorithms | GRPO, GSPO, on-policy distillation with a teacher model. default_loss_fn = DPPO + KL (a GRPO variant; similar lineage to DAPO's decoupled-clip idea but the upstream "DAPO" label is not used verbatim). |
| Custom-loss extension point | Best in class. src/prime_rl/trainer/rl/loss.py exposes a LossInputs/LossOutputs interface and setup_loss_fn resolves a config: trainer.loss.type = "custom" + trainer.loss.import_path = "your_pkg.your_module.your_loss_fn" + optional kwargs. The custom function receives trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask — i.e., the exact tensor inputs needed for a 3-channel loss (RLVR uses advantages, hint-distill uses teacher_logprobs, trace-replay can be threaded through kwargs as a precomputed reference). |
| Integration cost | Lowest. Estimated ~200–300 LOC total: ~120 LOC for a composer_three_channel_loss function in our package + ~30 LOC of config (recipes/composer_v0.toml), ~80 LOC Recipe doc. No subclassing required for the loss. A small adapter is needed if we precompute the trace-replay teacher distribution outside the LossInputs struct. |
| Data-path fit | Already aligned. PRIME-RL's orchestrator consumes verifiers environments via vf.EnvServer. The OpenEnv ↔ verifiers shim is a known small adapter (the verifiers library is the Hub-side env runner that OpenEnv's TRL guide already uses). Our existing OpenEnv-compatible TRL data path drops in with a thin wrapper. |
Verdict: Best fit for the framework. The combination of (i) config-driven custom loss with the right tensor signatures already present, (ii) verifiers/OpenEnv shape, (iii) decentralized async training that maps to our DiLoCo plans, makes PRIME-RL the substrate of choice for v0.1. Recommended addition #1.
2.3 NeMo-Aligner
| Field | Value |
|---|---|
| Repo | https://github.com/NVIDIA/NeMo-Aligner |
| License | Apache-2.0 |
| Maturity | Research-leaning production — NVIDIA-maintained, tied to NeMo/Megatron-LM. Advertised as "early stages of development" in its own README. |
| Algorithms | PPO, REINFORCE, RS (Rejection Sampling), DPO, RPO. No GRPO. No DAPO. |
| Custom-loss extension point | loss_func method on Megatron model classes (e.g. MegatronGPTDPOModel.loss_func). Requires NeMo model-class subclassing and Megatron-LM familiarity. |
| Integration cost | High. Estimated ~800–1,200 LOC including .nemo conversion of HF weights, Megatron model wrapping, custom Megatron loss_func, and a recipe. Plus the operational cost of running on Megatron-LM (Triton kernels, NeMo container). |
| Data-path fit | JSONL only; no OpenEnv. We'd write a full env adapter. |
Verdict: Wrong shape. No GRPO/DAPO and tightly bound to the NeMo ecosystem. Only relevant if we ever need NVIDIA-supported large-scale Megatron RL, which we don't for the Composer Replication v0.1/v0.2 horizon. Reject.
2.4 Unsloth (RL)
| Field | Value |
|---|---|
| Repo | https://github.com/unslothai/unsloth |
| License | Apache-2.0 (per public README; not surfaced by DeepWiki snapshot but well-known) |
| Maturity | Production for SFT and LoRA/QLoRA; research/preview for RL — RL support shipped in 2025 as a TRL patcher. |
| Algorithms | Wraps TRL → inherits TRL's GRPO; loss-type switch supports "grpo", "bnpo", "dr_grpo", "dapo", "cispo". So GRPO and DAPO are both available through the patched-TRL path. |
| Custom-loss extension point | Problematic. The actual loss kernels live in unsloth_zoo (a separate compiled dependency). The patcher (patch_trl_rl_trainers()) generates modified TRL trainer classes via exec() from string templates. To add a new loss type you would have to (a) modify or fork unsloth_zoo to add a kernel, (b) extend RL_REPLACEMENTS, and (c) extend the compute_loss() switch in the patcher template. There is no public Python subclass hook that survives the patching. |
| Integration cost | Very high if we want our own loss. Forking unsloth_zoo defeats the purpose of using Unsloth (which is the optimized kernels). Estimated ~1,000+ LOC plus an external repo to maintain. |
| Data-path fit | TRL-shaped, so OpenEnv via TRL is fine — but only for stock TRL losses. Our 3-channel loss does not survive Unsloth's patching. |
Verdict: Excellent for memory-efficient SFT and stock-GRPO LoRA. Wrong tool for a custom loss. Reject as the substrate; we may still use it as an optional QLoRA accelerator inside a stock-GRPO ablation run.
2.5 LLaMA-Factory
| Field | Value |
|---|---|
| Repo | https://github.com/hiyouga/LLaMA-Factory |
| License | Apache-2.0 |
| Maturity | Production for breadth (50+ model families, SFT/DPO/PPO recipes), but RL is a thin TRL wrapper. |
| Algorithms | PPO, DPO, KTO, ORPO, SimPO via Custom*Trainer subclasses of the corresponding trl.*Trainer classes. No GRPO. No DAPO in the repo itself; the README points to EasyR1 (an external GRPO framework) for those. |
| Custom-loss extension point | compute_preference_loss switch on CustomDPOTrainer (selects sigmoid / hinge / ipo / kto_pair / orpo / simpo). For PPO, you would subclass CustomPPOTrainer → which is trl.PPOTrainer. Effectively the same extension story as plain TRL, with a configuration layer on top. |
| Integration cost | Moderate, ~400 LOC, but you are essentially using TRL through one extra layer. |
| Data-path fit | Text/dataset-shaped, not OpenEnv-aware. Same OpenEnv-via-TRL story. |
Verdict: Useful as a multi-model SFT laboratory but does not move the ball for our RL-side requirements. Reject as substrate; we already have TRL.
2.6 DeepSpeed-Chat
| Field | Value |
|---|---|
| Repo | https://github.com/deepspeedai/DeepSpeedExamples (the applications/DeepSpeed-Chat/ subtree) |
| License | Apache-2.0 |
| Maturity | Effectively stale. The README's "Latest News" cuts off in August 2023. CI patches in 2025 (e.g., #6982, #7015, #7052) are dependency-pinning fixes, not feature work. The roadmap to "generalize DeepSpeed-RLHF abstraction for a wider range of RL algorithms" has not landed. |
| Algorithms | PPO (3-stage RLHF) + DPO. No GRPO. No DAPO. |
| Custom-loss extension point | DeepSpeedPPOTrainer.train_rlhf / actor_loss_fn / critic_loss_fn. Editable but not config-hooked. |
| Integration cost | Moderate, but you inherit a frozen architecture. ~500 LOC. |
| Data-path fit | Prompt-dataset-shaped; no OpenEnv. |
Verdict: Pioneering for its time, no longer competitive on algorithm coverage. Reject.
3. Meta PyTorch Agentic Stack — Infra vs Training Split
The brief asked specifically to distinguish coordination/infra from training-stack components. The answer is:
| Component | Layer | Status (May 2026) | In our framework? |
|---|---|---|---|
Monarch (meta-pytorch/monarch) |
Coordination / Infra — actor mesh, RDMA data plane, supervision trees | Active. v0.4 GA (2026-03-26), v0.5 dev wheels daily, BSD-3 | Yes — recommended addition. |
TorchTitan (pytorch/torchtitan) |
Training stack — FSDP2 / TP / PP / CP / float8 / MXFP8 | Active. BSD-3, "extensive development". Has an experimental GRPO recipe (experiments/rl/simple_grpo_sum_digits.py) on Monarch. |
Indirectly — already the trainer inside PRIME-RL and TorchForge. We adopt it transitively, not as a direct dependency. |
TorchForge (meta-pytorch/forge) |
RL post-training library | Development paused per the repo banner; consolidating into TorchTitan. ~685★. | Pattern reference only. Lift the Generator/Trainer/Rewarder shape but do not depend on the package. |
torchchat (pytorch/torchchat) |
Inference / local deployment | Active for its own scope, but: not a training framework; no RL surface. | Out of scope. |
OpenEnv (meta-pytorch/OpenEnv) |
Environment standard (covered separately) | Active. Already a v0 dependency of the framework. | Already adopted. |
3.1 Monarch
| Field | Value |
|---|---|
| Repo | https://github.com/meta-pytorch/monarch |
| License | BSD-3-Clause |
| PyPI | torchmonarch; v0.4.1 stable (2026-04-08), v0.5.0 dev wheels published daily through 2026-05-05 |
| Maturity | Experimental but actively shipped. "Currently in an experimental stage" per the repo's own status note, but with a functioning K8s operator, weekly wheels, ProcessMesh/ActorMesh APIs stable enough for VeRL backend experiments. |
| Role in our stack | Pure coordination/infra. It does not train models. It hosts whatever trainer you bring (TRL, VeRL, PRIME-RL, TorchTitan) as Actor subclasses on a ProcMesh. The monarch.spmd.SPMDActor automatically configures RANK/LOCAL_RANK/WORLD_SIZE for any PyTorch-distributed script — i.e., we can lift our existing TRL or PRIME-RL workers into Monarch with minimal change. |
| Key abstractions | ProcMesh (processes × hosts × GPUs), ActorMesh (typed actors with @endpoint methods), supervision trees, RDMA buffers, distributed tensors / DTensor integration. Underlying runtime: hyperactor (Rust). |
| Why over Ray | Tighter PyTorch/DTensor integration; explicit RDMA data plane (Ray uses object store + standard networking); single-controller mental model maps directly to RL post-training (one controller orchestrates Generator + Trainer + Rewarder + Env actors). |
| Integration cost into Composer Replication | ~300 LOC + ops: (a) wrap our PRIME-RL trainer as an SPMDActor; (b) wrap our vLLM rollout server as an Actor with an @endpoint generate(prompts) method; (c) write a single controller script that creates a ProcMesh, spawns both meshes, and shuttles DataProto-shaped messages; (d) Recipe doc. The ops cost is the harder half — Monarch's K8s operator is new (v0.2.0+). |
| Risk | Pre-1.0; API churn possible (e.g., KubernetesJob.add_mesh signature changed in v0.5). Mitigation: pin to torchmonarch==0.4.1 for v0.2 of our framework. |
3.2 TorchTitan
| Field | Value |
|---|---|
| Repo | https://github.com/pytorch/torchtitan |
| License | BSD-3-Clause |
| Maturity | Active development for pretraining; experimental for RL. The GRPO experiment (torchtitan/experiments/rl/simple_grpo_sum_digits.py) is in experiments/, which the repo explicitly disclaims as removable. |
| Role | Training stack only. Provides FSDP2 (per-parameter sharding), Tensor Parallel (incl. async TP), Pipeline Parallel (zero-bubble), Context Parallel (long-context), torch.compile, Float8, MXFP8, DDP, HSDP. |
| OpenEnv-aware? | No, but the experimental RLTrainer integrates vLLM + Monarch actors, which is the same shape PRIME-RL uses. |
| Why we don't add it directly | PRIME-RL already uses TorchTitan-equivalent FSDP2 internals, and TorchForge's training core was TorchTitan. Adding TorchTitan as a direct dependency would mean writing our own RL loop on top of it — that's TorchForge's job, and Meta paused exactly that effort. The right move is to depend on PRIME-RL, which has battle-tested distributed training patterns equivalent to TorchTitan's, and revisit TorchTitan directly only when we genuinely need its experimental zero-bubble PP or MXFP8 paths. |
3.3 TorchForge (Paused)
- Repo banner: "Development paused — LLM training consolidating in TorchTitan."
- ~685 ★, 100+ open issues, last meaningful release in early 2026.
- Patterns we should still copy:
- Generator/Trainer/Rewarder ActorMesh decomposition
- TorchStore-style RDMA weight broadcast
- Async toggle between sync PPO-like and fully async off-policy
- We do not add a TorchForge dependency. Architectural reference only.
3.4 torchchat (Out of Scope)
- Inference / local deployment of LLMs (Eager /
torch.compile/ AOT Inductor / ExecuTorch / mobile). - No training, no RL.
- Mentioned in the brief for completeness; ruled out cleanly.
4. Comparison Matrix
4.1 RL Frameworks
| Framework | License | Last release | Maturity | GRPO | DAPO | Custom-loss hook | OpenEnv fit | Est. integration LOC |
|---|---|---|---|---|---|---|---|---|
| TRL (baseline) | Apache-2.0 | Active | Production | ✅ | partial (tricks land per release) | Subclass GRPOTrainer.compute_loss |
✅ native (Oct 2025 OpenEnv guide) | already integrated |
| VeRL (baseline) | Apache-2.0 | Active | Production | ✅ | ✅ | core_algos.py + worker subclass |
shim via Ray dataloader | already skeleton |
| OpenRLHF | Apache-2.0 | v0.9.10 (2026-04-04) | Production | ✅ | ✅ | openrlhf/models/loss.py + Trainer subclass; no config hook |
shim via agent_func_path |
~400–600 |
| PRIME-RL ⭐ | Apache-2.0 | v0.5.0 (2026-03-30) | Prod-research | ✅ | partial (DPPO+KL variant; not labeled DAPO) | CustomLossConfig import_path — first-class |
✅ via verifiers (OpenEnv-compatible) |
~200–300 |
| NeMo-Aligner | Apache-2.0 | Active | Research-leaning | ❌ | ❌ | Megatron model loss_func |
none; JSONL only | ~800–1,200 |
| Unsloth (RL) | Apache-2.0 | Active | Production (SFT) / preview (RL) | ✅ (via TRL patch) | ✅ (via TRL patch) | Loss kernels in closed unsloth_zoo; effectively unhookable |
TRL-shaped | ~1,000+ (forking) |
| LLaMA-Factory | Apache-2.0 | Active | Production | ❌ (delegates to EasyR1) | ❌ | TRL Custom*Trainer subclass |
TRL-shaped | ~400 |
| DeepSpeed-Chat | Apache-2.0 | Stale (Aug 2023 features; 2025 only CI fixes) | Effectively maintained-only | ❌ | ❌ | DeepSpeedPPOTrainer subclass |
none | ~500 |
4.2 Meta PyTorch Stack
| Component | Layer | License | Status | In recommendation? |
|---|---|---|---|---|
| Monarch ⭐ | Coordination / actor mesh | BSD-3 | Active (v0.4 GA, v0.5 dev) | Yes |
| TorchTitan | Training stack | BSD-3 | Active; RL experimental | Indirect (via PRIME-RL) |
| TorchForge | RL library | BSD-3 | Paused | No — patterns only |
| torchchat | Inference / deployment | BSD-3 | Active | No — out of scope |
| OpenEnv | Environment standard | (Hub) | Active | Already adopted |
5. Recommendation Rationale
5.1 Why PRIME-RL, not OpenRLHF
OpenRLHF is in many ways the safer pick: more stars, more contributors, more algorithm coverage (it explicitly ships DAPO). The deciding factor is the shape of our custom loss.
The Composer Replication Framework's signature contribution is the three-channel reward:
- RLVR — tests-pass scalar from the OpenEnv environment.
- Composer-style hint-distill (SDPO/OPSD) — the model self-teaches against its own hint-conditioned roll-outs; needs
teacher_logprobsaligned to the rollout token grid. - Trace-replay multi-teacher PRM (the novel bit) — N frozen external teachers' precomputed token-level distributions, replayed against the on-policy rollout.
PRIME-RL's LossInputs dataclass already exposes exactly the tensors we need:
trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask
A custom 3-channel loss is roughly:
def composer_three_channel_loss(li: LossInputs, *, hint_weight, replay_weight, replay_logits) -> LossOutputs:
rlvr = grpo_term(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask)
hint = kl_term(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)
replay = kl_term(li.trainer_logprobs, replay_logits, li.loss_mask)
return LossOutputs(loss=rlvr + hint_weight * hint + replay_weight * replay, ...)
We register this with trainer.loss.type = "custom" + import_path and we're done. No subclassing, no exec()-patched template, no Megatron model wrapping.
OpenRLHF would require us to (a) add a ThreeChannelLoss nn.Module to openrlhf/models/loss.py, (b) subclass PPOTrainer (or equivalent GRPO trainer) to construct it with the right teacher-logprob plumbing, and (c) carry that fork forward. ~2× the LOC, plus a fork to maintain.
A second factor: PRIME-RL's verifiers env protocol is a direct precursor of OpenEnv's wire shape (HTTP/WebSocket env servers, typed observations). Our existing OpenEnv-compatible TRL data path translates with a thin adapter. OpenRLHF's agent_func_path is more of an escape hatch than a contract.
A third factor: PRIME-RL was built for decentralized training (INTELLECT-1/2). Even though our v0.1 stays on a single cluster, the v0.2 multi-DC story drops in cleanly. OpenRLHF is Ray-on-one-cluster by design.
5.2 Why Monarch, not TorchTitan or TorchForge
Among the four Meta-stack components in the brief, only one is both (a) ours to add and (b) genuinely new functionality:
- TorchForge is paused — depending on it now is a known dead end.
- TorchTitan is already inside PRIME-RL transitively (PRIME-RL uses FSDP2 plus a SHARDCAST weight-broadcast layer that is morally equivalent to what TorchTitan offers). Adding TorchTitan as a direct dependency means writing our own RL loop on top of it, which is exactly what TorchForge tried and paused. We get TorchTitan's benefits without owning the integration.
- torchchat is for local inference / mobile deployment — out of scope.
- Monarch is the unique value: a PyTorch-native actor mesh that lets us replace Ray (PRIME-RL's current orchestration substrate) with something that has explicit RDMA, supervision trees, and ProcMesh/ActorMesh primitives that map directly onto our (Generator, Trainer, Rewarder, EnvServer) topology.
The migration path is incremental:
- v0.1: PRIME-RL on Ray (current). Monarch listed as roadmap.
- v0.2: Wrap PRIME-RL's Trainer as a
monarch.spmd.SPMDActor, vLLM Generator as anActorwith an@endpoint generate(). Switch the orchestrator fromray.init()tothis_host().spawn_procs(). - Risk-mitigation: pin to
torchmonarch==0.4.1(the last GA release before v0.5 dev). Keep a Ray fallback path active until v0.2 is stable.
6. Integration Sketches
6.1 PRIME-RL Recipe skeleton
recipes/composer_v0_prime_rl.toml (~30 LOC):
# composer_v0_prime_rl.toml
[model]
name = "Qwen/Qwen3-32B" # or Kimi-K2.5 when MoE support lands
[data]
env = "swe_bench_lite" # via verifiers EnvServer; wraps our OpenEnv adapter
batch_size = 64
group_size = 16
[trainer]
algorithm = "grpo"
> **Realised in v0.1 (Wave 17 update):** Wave 14b shipped the PRIME-RL
> recipe at `composer_replication/recipes/prime_rl/prime_rl_config.yaml`
> as **YAML** with a different kwarg surface than the TOML sketch below.
> The actual recipe shape:
>
> ```yaml
> # composer_replication/recipes/prime_rl/prime_rl_config.yaml
> model:
> base: "Qwen/Qwen2.5-0.5B"
> attn_implementation: "flash_attention_2"
> dtype: "bfloat16"
> env:
> protocol: "verifiers"
> config: { name: "math/gsm8k", split: "train" }
> loss:
> custom:
> import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
> kwargs:
> alpha_sdpo: 0.0 # channel 2 deferred in v0
> beta_dpo: 0.0 # channel 3 out-of-scope for PRIME-RL v0
> dppo_mask_high: 0.2 # PRIME-RL DPPO convention (NOT textbook PPO)
> dppo_mask_low: 0.2 # both must be >= 0 per Field(..., ge=0)
> adv_tau: 1.0 # advantage normalization
> kl_tau: 0.04 # KL coefficient
> ```
>
> The realised `loss_fn(inputs, **kwargs)` matches PRIME-RL's
> `LossInputs`/`LossOutputs` interface (read upstream `prime_rl/loss.py`
> for parity verification — Wave 14b's shadow-parity test independently
> restates the formula in
> `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`).
>
> The pre-Wave-14b TOML/`hint_weight`/`replay_weight` sketch below is
> preserved as historical proposal context.
[trainer.loss]
type = "custom"
import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
[trainer.loss.kwargs]
hint_weight = 0.5
replay_weight = 0.25
replay_logits_path = "/data/teachers/precomputed_replay.zarr"
[teacher]
model = "Qwen/Qwen3-32B" # same as policy = self-teacher for hint-distill
hint_template = "composer.hint_v1"
[orchestrator]
sync_mode = "async"
shardcast = true
composer_replication/recipes/prime_rl/composer_loss.py (~120 LOC; current Wave 14b
implementation defines loss_fn(inputs, **kwargs) rather than the
composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)
signature sketched below):
# composer_replication/recipes/prime_rl/composer_loss.py — sketch only;
# the actual signature evolved during Wave 14b. See module docstring for
# the current `loss_fn` contract.
from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
def composer_three_channel_loss(
li: LossInputs,
*,
hint_weight: float,
replay_weight: float,
replay_logits_handle: str,
) -> LossOutputs:
# 1. RLVR via GRPO surrogate
rlvr = grpo_surrogate(li.trainer_logprobs, li.inference_logprobs,
li.advantages, li.loss_mask)
# 2. Hint-distill: KL(policy || hint-conditioned teacher)
hint = masked_kl(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)
# 3. Trace-replay: KL(policy || precomputed multi-teacher mixture)
replay = trace_replay_kl(li.trainer_logprobs, replay_logits_handle, li.loss_mask)
total = rlvr + hint_weight * hint + replay_weight * replay
return LossOutputs(
loss=total,
metrics={"rlvr": rlvr.item(), "hint": hint.item(), "replay": replay.item()},
)
Plus docs/recipes/composer_v0_prime_rl.md (~50 LOC) describing data layout, teacher precomputation, and reproducibility hashes.
Total: ~200 LOC of code + ~30 LOC config + ~50 LOC docs ≈ 280 LOC.
6.2 Monarch wrap-up sketch (v0.2)
# composer_replication/orchestrator/monarch_runner.py (~120 LOC)
from monarch.actor import Actor, endpoint
from monarch.proc_mesh import this_host, ProcMesh
class TrainerActor(Actor):
@endpoint
async def step(self, batch): ...
class GeneratorActor(Actor):
@endpoint
async def generate(self, prompts): ...
class RewarderActor(Actor):
@endpoint
async def score(self, traj): ...
async def main(cfg):
train_mesh = await this_host().spawn_procs(TrainerActor, hosts=4, gpus=8)
gen_mesh = await this_host().spawn_procs(GeneratorActor, hosts=2, gpus=8)
rew_mesh = await this_host().spawn_procs(RewarderActor, hosts=1, gpus=2)
async for step in range(cfg.steps):
prompts = await env.batch()
traj = await gen_mesh.generate.broadcast(prompts)
rewards = await rew_mesh.score.broadcast(traj)
await train_mesh.step.broadcast({"traj": traj, "rewards": rewards})
Total: ~120 LOC controller + ~50 LOC ops (K8s operator manifest) + ~80 LOC recipe doc ≈ 250 LOC.
7. Sources
Primary
- OpenRLHF — https://github.com/OpenRLHF/OpenRLHF (README, Releases v0.9.10), Apache-2.0; DeepWiki:
openrlhf/models/loss.py,agent_func_path. - PRIME-RL — https://github.com/PrimeIntellect-ai/prime-rl (README, Releases v0.5.0), Apache-2.0; DeepWiki:
src/prime_rl/trainer/rl/loss.py,CustomLossConfig,LossInputs/LossOutputs,verifiersintegration. - NeMo-Aligner — https://github.com/NVIDIA/NeMo-Aligner, Apache-2.0; DeepWiki: PPO/REINFORCE/DPO/RPO;
loss_funcon Megatron model classes. - Unsloth — https://github.com/unslothai/unsloth, README RL section; DeepWiki:
patch_trl_rl_trainers(),unsloth_zookernels, DAPO loss-type switch. - LLaMA-Factory — https://github.com/hiyouga/LLaMA-Factory, Apache-2.0; DeepWiki:
CustomPPOTrainer/CustomDPOTrainer, EasyR1 reference for GRPO. - DeepSpeed-Chat — https://github.com/deepspeedai/DeepSpeedExamples (
applications/DeepSpeed-Chat/), Apache-2.0; DeepWiki: 3-stage PPO, DPO; "Latest News" cutoff Aug 2023; 2025 PRs (#6982, #7015, #7052) confirming maintenance-only mode. - Monarch — https://github.com/meta-pytorch/monarch, BSD-3; PyPI
torchmonarchv0.4.1 (2026-04-08), v0.5.0 dev wheels through 2026-05-05; DeepWiki:ProcMesh,ActorMesh,monarch.spmd.SPMDActor. - TorchTitan — https://github.com/pytorch/torchtitan, BSD-3; DeepWiki: FSDP2/TP/PP/CP,
torchtitan/experiments/rl/simple_grpo_sum_digits.py, integration with vLLM and Monarch. - TorchForge — https://github.com/meta-pytorch/forge, BSD-3, repo banner "development paused — consolidating in TorchTitan".
- torchchat — https://github.com/pytorch/torchchat, BSD-3; DeepWiki: inference-only (eager /
torch.compile/ AOT Inductor / ExecuTorch).
Companion repository docs (already present)
~/wiki/research/post-training-framework/04-verl-trl.md— VeRL vs TRL deep dive.~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md— full Meta-stack survey.~/wiki/research/post-training-framework/02-diloco-family.md— DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2.~/wiki/projects/composer-replication-framework.md— current TL;DR and stage plan.
Notes on accuracy
- "DAPO" labeling: OpenRLHF and Unsloth both advertise DAPO as a first-class loss type; PRIME-RL implements a DAPO-equivalent (decoupled-clip + KL) but uses the internal name
DPPO+KLin its default loss. For our purposes this is the same family. - Last-commit dates and release versions are pulled from GitHub release pages (OpenRLHF, PRIME-RL) and PyPI release history (
torchmonarch). - Star counts and contributor counts reflect the snapshots returned by web search at the time of writing (May 2026) and will drift; the relative ordering is stable.