composer-replication-framework / docs /V3_SUBSTRATE_COVERAGE.md

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 12 days ago

7.35 kB

	# V3 Substrate Coverage — Monarch / TorchForge / OpenEnv / VeRL / TRL / DiLoCo

	The brief's V3 clause asks the framework to cover six substrates. This doc
	maps each to what we have + what we don't + **why that's the right
	shape** given the substrate's status and the framework's scope.

	## TRL — `huggingface/trl`

	Status: ✅ Production target for v0.1. Working code.

	What we have:
	- Research deep-dive: `research/04-verl-trl.md` § 3 (algorithm coverage:
	GRPO / DAPO / DPO / PRM, extension points, `_compute_loss` vs `compute_advantages`)
	- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe A
	- Working code: `composer_replication.trainer.ComposerReplicationTrainer`
	subclasses `GRPOTrainer`, overrides `_compute_loss(model, inputs)` to
	compose 3 channels (`grpo + α·sdpo + β·trace_replay_dpo`)
	- Data collator: `composer_replication.trainer.data_collator.ComposerDataCollator`
	builds the `inputs` dict the trainer expects
	- DeepWiki audit: extension surface verified against TRL HEAD as of 2026-05-25

	What we don't:
	- A full end-to-end training run (gated on real GPU rollouts +
	reward calculations — out of scope for CPU-budget deep-work-loop)

	Why this shape: TRL is the most-supported substrate for GRPO post-training.
	Its `GRPOTrainer.subclass.override._compute_loss` extension point is the
	cleanest path. Production v0.1 lives here.

	---

	## VeRL — `volcengine/verl`

	Status: 🟡 Production target for v0.2 (multi-node scale). Skeleton, not yet runnable.

	What we have:
	- Research deep-dive: `research/04-verl-trl.md` § 4 (3D-HybridEngine,
	resharding pattern, advantage estimator registry)
	- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe B
	- Skeleton code: `spikes/005-integrated-trainer-skeleton/verl_path/`
	- `composer_adv.py` (110 LOC) — `@register_adv_est("composer_3channel")` decorator
	- `composer_config.yaml` (89 LOC) — full PPO trainer config with our advantage estimator wired in
	- DeepWiki audit: extension surface verified against VeRL HEAD as of 2026-05-25

	What we don't:
	- A working VeRL run on real hardware (VeRL itself has steep setup;
	v0.1 prioritizes TRL because it's faster to iterate on)

	Why this shape: VeRL's 3D-HybridEngine and decentralized scheduler are
	better than TRL's at >32 GPU scale. We build the recipe but don't make it
	the default. The framework supports either path; users on >8-GPU clusters
	should use VeRL.

	---

	## DiLoCo — `meta-pytorch/torchft`

	Status: 🟡 Outer-loop wrapper integrated. Multi-replica convergence GPU-gated.

	What we have:
	- Research deep-dive: `research/02-diloco-family.md` (DiLoCo / OpenDiLoCo /
	Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 — full audit with primary
	source links and license/maturity assessment)
	- ADR: `docs/adrs/ADR-003-diloco-impl.md` — chose `torchft.local_sgd.DiLoCo`
	(BSD-3, Meta-maintained, library-not-research-code) over 4 alternatives
	- Working code: `composer_replication.diloco.make_diloco_outer_loop`
	wrapper. Documents the sign convention (pseudo-grad = θ_initial - θ_local).
	- Spike 008: 5/5 single-process tests. Sign-convention test is the
	single best test in the framework (per cross-model review).
	- Reconnaissance: `docs/research/DILOCO_RECONNAISSANCE.md`

	What we don't:
	- True multi-replica convergence test. Single-process post-hook
	sequencing prevents this (replica A's outer step completes before
	replica B's allreduce arrives). Real-multi-process test deferred to
	GPU phase.
	- Trainer integration. The wrapper is a context manager; wiring it into
	`ComposerReplicationTrainer.train()` lifecycle is a separate spike.

	Why this shape: DiLoCo's value proposition (decentralized inner training
	with sparse outer sync) only matters at multi-cluster scale. Our v0.1
	target is single-cluster training with TRL. The DiLoCo wrapper is wired
	up so v0.2 multi-cluster training can switch it on with one config change.

	---

	## OpenEnv

	Status: 📋 Reference pattern (substrate, not a choice).

	What we have:
	- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § OpenEnv
	(the env-format standard, how it interacts with TRL's `environment_factory=`)
	- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe D —
	"OpenEnv is a substrate, not a choice"

	What we don't:
	- Direct OpenEnv code dependency. The framework's data path is
	OpenEnv-compatible by virtue of using TRL's API, which accepts
	`environment_factory=` kwargs that OpenEnv environments satisfy.

	Why this shape: OpenEnv is a protocol (how an env exposes itself
	to a trainer), not a library you depend on. You either implement an
	OpenEnv-compatible environment or you don't. Composer 2.5's "Feature
	Deletion" environment is OpenEnv-shaped; if a user provides one, our
	TRL trainer accepts it via `environment_factory=`.

	---

	## Monarch (Meta)

	Status: 📋 Reference pattern (alternative coordination model).

	What we have:
	- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § Monarch
	(actor mesh, hardware abstractions, comparison to Ray)
	- Integration recipe: `docs/INTEGRATION_ARCHITECTURE.md` Recipe C —
	"TorchForge + Monarch (reference patterns only, not a production target)"

	What we don't:
	- Direct Monarch code dependency. We use DiLoCo's pseudo-gradient sync
	as our coordination model; Monarch's actor mesh is an alternative.

	Why this shape: Monarch is alive (Meta is shipping it) but it's a
	coordination layer, not an algorithm. Our framework integrates with
	PyTorch + TRL + torchft directly; Monarch would replace the coordination
	layer underneath. Documented as a future option; not a v0.1 dependency.

	---

	## TorchForge (Meta, paused)

	Status: 📋 Reference only (upstream paused).

	What we have:
	- Research deep-dive: `research/03-monarch-torchforge-openenv.md` § TorchForge
	— design lessons captured

	What we don't:
	- Code dependency. TorchForge as a project was paused by Meta.

	Why this shape: The brief asked us to research TorchForge. We did.
	The headline finding is "Meta paused this." That's a real research output
	even if it doesn't translate to code.

	---

	## Summary

	\| Substrate \| Research \| Recipe \| Code \| Tests \| v0.1 production? \|
	\|---\|---\|---\|---\|---\|---\|
	\| TRL \| ✅ \| ✅ \| ✅ \| 38 + 9 + 3 = 50 \| ✅ \|
	\| VeRL \| ✅ \| ✅ \| 🟡 (skeleton) \| — \| v0.2 \|
	\| PRIME-RL (Wave 13) \| ✅ \| ✅ \| 🟡 (loss adapter + config) \| — \| v0.2 (cleanest hook) \|
	\| DiLoCo (single-process) \| ✅ \| ✅ \| ✅ \| 5 (single-replica) \| optional \|
	\| DiLoCo over serverless (Wave 13) \| ✅ \| ✅ ADR-005 \| ✅ Local + 🟡 Modal/HFJobs \| 9 multi-process \| ✅ (local) / future (cloud) \|
	\| OpenEnv \| ✅ \| ✅ \| n/a (protocol) \| — \| substrate \|
	\| Monarch (Wave 13) \| ✅ \| ✅ (actor layout) \| 🟡 (skeleton) \| — \| v0.2+ \|
	\| TorchForge \| ✅ \| n/a (paused) \| n/a \| — \| n/a \|

	8/8 substrates covered (was 6/6 pre-Wave-13). New since Wave 13:
	PRIME-RL (the cleanest custom-loss hook), Monarch (Meta's actively-shipped
	agentic-stack component), and serverless DiLoCo (Modal/HF Jobs adapters
	+ object-store rendezvous). The framework can now realize Decoupled
	DiLoCo across cloud executors without any cross-job NCCL — see
	ADR-005 for the design rationale.