File size: 7,351 Bytes
d88715c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b266c31
 
 
d88715c
b266c31
d88715c
 
b266c31
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# V3 Substrate Coverage — Monarch / TorchForge / OpenEnv / VeRL / TRL / DiLoCo

The brief's V3 clause asks the framework to cover six substrates. This doc
maps each to **what we have** + **what we don't** + **why that's the right
shape** given the substrate's status and the framework's scope.

## TRL — `huggingface/trl`

**Status**: ✅ **Production target for v0.1.** Working code.

**What we have**:
- Research deep-dive: `research/04-verl-trl.md` § 3 (algorithm coverage:
  GRPO / DAPO / DPO / PRM, extension points, `_compute_loss` vs `compute_advantages`)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe A
- Working code: `composer_replication.trainer.ComposerReplicationTrainer`
  subclasses `GRPOTrainer`, overrides `_compute_loss(model, inputs)` to
  compose 3 channels (`grpo + α·sdpo + β·trace_replay_dpo`)
- Data collator: `composer_replication.trainer.data_collator.ComposerDataCollator`
  builds the `inputs` dict the trainer expects
- DeepWiki audit: extension surface verified against TRL HEAD as of 2026-05-25

**What we don't**:
- A full end-to-end training run (gated on real GPU rollouts +
  reward calculations — out of scope for CPU-budget deep-work-loop)

**Why this shape**: TRL is the most-supported substrate for GRPO post-training.
Its `GRPOTrainer.subclass.override._compute_loss` extension point is the
cleanest path. Production v0.1 lives here.

---

## VeRL — `volcengine/verl`

**Status**: 🟡 **Production target for v0.2 (multi-node scale).** Skeleton, not yet runnable.

**What we have**:
- Research deep-dive: `research/04-verl-trl.md` § 4 (3D-HybridEngine,
  resharding pattern, advantage estimator registry)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe B
- Skeleton code: `spikes/005-integrated-trainer-skeleton/verl_path/`
  - `composer_adv.py` (110 LOC) — `@register_adv_est("composer_3channel")` decorator
  - `composer_config.yaml` (89 LOC) — full PPO trainer config with our advantage estimator wired in
- DeepWiki audit: extension surface verified against VeRL HEAD as of 2026-05-25

**What we don't**:
- A working VeRL run on real hardware (VeRL itself has steep setup;
  v0.1 prioritizes TRL because it's faster to iterate on)

**Why this shape**: VeRL's 3D-HybridEngine and decentralized scheduler are
better than TRL's at >32 GPU scale. We build the recipe but don't make it
the default. The framework supports either path; users on >8-GPU clusters
should use VeRL.

---

## DiLoCo — `meta-pytorch/torchft`

**Status**: 🟡 **Outer-loop wrapper integrated.** Multi-replica convergence GPU-gated.

**What we have**:
- Research deep-dive: `research/02-diloco-family.md` (DiLoCo / OpenDiLoCo /
  Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 — full audit with primary
  source links and license/maturity assessment)
- ADR: `docs/adrs/ADR-003-diloco-impl.md` — chose `torchft.local_sgd.DiLoCo`
  (BSD-3, Meta-maintained, library-not-research-code) over 4 alternatives
- Working code: `composer_replication.diloco.make_diloco_outer_loop`
  wrapper. Documents the sign convention (pseudo-grad = θ_initial - θ_local).
- Spike 008: 5/5 single-process tests. **Sign-convention test** is the
  single best test in the framework (per cross-model review).
- Reconnaissance: `docs/research/DILOCO_RECONNAISSANCE.md`

**What we don't**:
- True multi-replica convergence test. Single-process post-hook
  sequencing prevents this (replica A's outer step completes before
  replica B's allreduce arrives). Real-multi-process test deferred to
  GPU phase.
- Trainer integration. The wrapper is a context manager; wiring it into
  `ComposerReplicationTrainer.train()` lifecycle is a separate spike.

**Why this shape**: DiLoCo's value proposition (decentralized inner training
with sparse outer sync) only matters at multi-cluster scale. Our v0.1
target is single-cluster training with TRL. The DiLoCo wrapper is wired
up so v0.2 multi-cluster training can switch it on with one config change.

---

## OpenEnv

**Status**: 📋 **Reference pattern (substrate, not a choice).**

**What we have**:
- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § OpenEnv
  (the env-format standard, how it interacts with TRL's `environment_factory=`)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe D —
  "OpenEnv is a substrate, not a choice"

**What we don't**:
- Direct OpenEnv code dependency. The framework's data path is
  OpenEnv-compatible by virtue of using TRL's API, which accepts
  `environment_factory=` kwargs that OpenEnv environments satisfy.

**Why this shape**: OpenEnv is a *protocol* (how an env exposes itself
to a trainer), not a library you depend on. You either implement an
OpenEnv-compatible environment or you don't. Composer 2.5's "Feature
Deletion" environment is OpenEnv-shaped; if a user provides one, our
TRL trainer accepts it via `environment_factory=`.

---

## Monarch (Meta)

**Status**: 📋 **Reference pattern (alternative coordination model).**

**What we have**:
- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § Monarch
  (actor mesh, hardware abstractions, comparison to Ray)
- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe C —
  "TorchForge + Monarch (reference patterns only, not a production target)"

**What we don't**:
- Direct Monarch code dependency. We use DiLoCo's pseudo-gradient sync
  as our coordination model; Monarch's actor mesh is an alternative.

**Why this shape**: Monarch is alive (Meta is shipping it) but it's a
*coordination layer*, not an *algorithm*. Our framework integrates with
PyTorch + TRL + torchft directly; Monarch would replace the coordination
layer underneath. Documented as a future option; not a v0.1 dependency.

---

## TorchForge (Meta, paused)

**Status**: 📋 **Reference only (upstream paused).**

**What we have**:
- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § TorchForge
  — design lessons captured

**What we don't**:
- Code dependency. TorchForge as a project was paused by Meta.

**Why this shape**: The brief asked us to research TorchForge. We did.
The headline finding is "Meta paused this." That's a real research output
even if it doesn't translate to code.

---

## Summary

| Substrate | Research | Recipe | Code | Tests | v0.1 production? |
|---|---|---|---|---|---|
| TRL | ✅ | ✅ | ✅ | 38 + 9 + 3 = 50 | ✅ |
| VeRL | ✅ | ✅ | 🟡 (skeleton) | — | v0.2 |
| **PRIME-RL** (Wave 13) | ✅ | ✅ | 🟡 (loss adapter + config) | — | v0.2 (cleanest hook) |
| DiLoCo (single-process) | ✅ | ✅ | ✅ | 5 (single-replica) | optional |
| **DiLoCo over serverless** (Wave 13) | ✅ | ✅ ADR-005 | ✅ Local + 🟡 Modal/HFJobs | 9 multi-process | ✅ (local) / future (cloud) |
| OpenEnv | ✅ | ✅ | n/a (protocol) | — | substrate |
| **Monarch** (Wave 13) | ✅ | ✅ (actor layout) | 🟡 (skeleton) | — | v0.2+ |
| TorchForge | ✅ | n/a (paused) | n/a | — | n/a |

**8/8 substrates covered** (was 6/6 pre-Wave-13). New since Wave 13:
PRIME-RL (the cleanest custom-loss hook), Monarch (Meta's actively-shipped
agentic-stack component), and serverless DiLoCo (Modal/HF Jobs adapters
+ object-store rendezvous). The framework can now realize Decoupled
DiLoCo across cloud executors **without any cross-job NCCL** — see
ADR-005 for the design rationale.