composer-replication-framework / docs /V1_V8_COVERAGE.md
Codeseys's picture
Wave 15: 4-angle multi-model self-critique caught 2 math BLOCKERs in primary loss kernels; fixed against upstream byte-for-byte + GSM8K example + ergonomics
e5add15

V1–V8 Coverage Matrix — Composer 2.5 Replication Framework

This document maps each of the 8 clauses of the original brief to the runnable artifact (or honest gap) in this repo as of HEAD.

The brief, decomposed:

[V1] dive into Composer 2.5 and understand what makes it so much better [V2] take that and combine it with diloco (decoupled, open, any variant of diloco) [V3] and monarch/torchforge/openenv/VeRL/TRL [V4] and make a framework that we can use to further RL training of models to take them to the next level [V5] One of the ideas that I had that might be a parallel to this is to use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do [V6] by doing this we get distillation data from any number of models that could be used to train the target model further [V7] can we research all of this and see how we could try to set this up as a framework [V8] to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5

Coverage at-a-glance

Clause Status Headline artifact Notes
V1 ✅ Closed research/01-composer-2.5.md + docs/COMPOSER_RECIPE_MAPPING.md + Spike 005 trainer skeleton Identified SDPO/OPSD as Composer's secret sauce; traced to arXiv:2601.20802 (ICLR 2026); audited siyan-zhao/OPSD (MIT) for the loss kernel; lifted generalized_jsd_loss into our framework as composer_replication.opsd.generalized_jsd_loss.
V2 ⚠️ Partial composer_replication.diloco.make_diloco_outer_loop wraps torchft.local_sgd.DiLoCo (BSD-3) Spike 008 verifies the outer-loop machinery + sign-convention on 1 replica. Cross-replica convergence is GPU-multi-process and not yet attempted. ADR-003 documents the choice. Wrapper is not yet integrated with ComposerReplicationTrainer — it's an independent context manager.
V3 ✅ Closed (research + recipes) See § "V3 substrate coverage" below Each substrate has a research deep-dive + an integration recipe. TRL has working code; VeRL has a config + adv-estimator skeleton; Monarch/TorchForge/OpenEnv are documented as reference patterns per the brief's "research" framing.
V4 ✅ Closed (installable) pip install -e . ships composer_replication package pyproject.toml at repo root; examples/qwen_05b_quickstart/ runs end-to-end. The package re-exports the verified APIs from spike directories (loss, batch, opsd, teacher_replay, ingestion, trainer, diloco).
V5 ✅ Closed composer_replication.ingestion.ClaudeCodeIngester + Spike 007 e2e test Real Claude Code session JSONL → TraceStatecompose_loss end-to-end smoke. ADR-002 documents the source choice + Claude Code circularity risk. 18 tests passing (15 unit + 3 e2e-with-loss).
V6 ✅ Closed composer_replication.teacher_replay.replay_trace + Spike 001 verdict Multi-teacher OpenRouter replay measured at $0.98/50-step trace, p95 latency 20.5s, 0 errors over 150 calls. Distillation data shape is DPOPair(state_id, state_messages, chosen, rejected, n_teachers_agreeing).
V7 ✅ Closed 5 research deep-dives + ADRs + integration architecture + working framework The "research and see how" question is empirically answered: framework built, primary-source-validated, four production extension paths documented. Process is auditable.
V8 ⚠️ Partial Spike 006 (CPU smoke) + Spike 002a-mini (GPU smoke) Real Qwen2.5-0.5B-Instruct loads via AutoModelForCausalLM, runs through the 3-channel loss on both CPU (Spike 006) and GPU (Spike 002a-mini, RTX 5090, bf16, 5.3 GB peak VRAM, 480ms/step). The "Composer 2.5-quality results" half of V8 is GPU-budget-gated post-replication work (Spikes 002b/003/004).

Tally: 6/8 closed, 2/8 partial. Both partials (V2 multi-process DiLoCo, V8 quality-of-results) are gated on GPU-multi-process work that is out of scope for the CPU-budget deep-work-loop phase.


V3 substrate coverage (detailed)

V3 names six substrates: monarch, torchforge, openenv, VeRL, TRL (plus DiLoCo from V2). Each has a deep-dive research doc and an integration recipe. The "framework" target lives at the intersection of all of them.

Substrate Research deep-dive Integration recipe Working code Notes
TRL (huggingface/trl) research/04-verl-trl.md § 3 docs/INTEGRATION_ARCHITECTURE.md Recipe A composer_replication.trainer.ComposerReplicationTrainer subclasses GRPOTrainer. _compute_loss override composes 3 channels. Production target for v0.1. DeepWiki-audited extension point: GRPOTrainer._compute_loss(model, inputs).
VeRL (volcengine/verl) research/04-verl-trl.md § 4 docs/INTEGRATION_ARCHITECTURE.md Recipe B 🟡 spikes/005/verl_path/composer_adv.py (110 LOC) + composer_config.yaml (89 LOC). Skeleton, not yet runnable. Production target for v0.2 scale (multi-node). Extension point: @register_adv_est(name) decorator + DataProto.batch/non_tensor_batch for extra fields.
DiLoCo (meta-pytorch/torchft) research/02-diloco-family.md (full DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 audit) docs/adrs/ADR-003-diloco-impl.md 🟡 composer_replication.diloco.make_diloco_outer_loop wraps torchft.local_sgd.DiLoCo (BSD-3). Spike 008 has 5 single-process tests including sign-convention pin. Multi-replica convergence not yet tested — single-process post-hook sequencing prevents this in CPU-only smoke. Real torch.distributed test deferred to GPU phase.
OpenEnv research/03-monarch-torchforge-openenv.md § OpenEnv docs/INTEGRATION_ARCHITECTURE.md Recipe D 📋 Reference pattern, no code Per the integration doc: "OpenEnv is a substrate, not a choice — it specifies how environments expose themselves to trainers." TRL accepts environment_factory= kwarg; VeRL has equivalent. Not a code dependency for v0.1; the framework's data path is OpenEnv-compatible by virtue of using TRL's API.
Monarch (Meta) research/03-monarch-torchforge-openenv.md § Monarch docs/INTEGRATION_ARCHITECTURE.md Recipe C 📋 Reference pattern Monarch is Meta's actor mesh — a coordination layer for distributed workers, not an algorithm. Per the research doc: "Monarch is alive, TorchForge is paused" (as of 2026-Q2). The framework's outer-loop sync via DiLoCo is an alternative coordination model that doesn't need Monarch.
TorchForge (Meta, paused) research/03-monarch-torchforge-openenv.md § TorchForge n/a (paused upstream) 📋 Reference only TorchForge as a project was paused by Meta. Research doc captures the design lessons; no code dependency.

Honest read: TRL + VeRL + DiLoCo are the three substrates the framework actually integrates with. Monarch/TorchForge/OpenEnv are documented as informed-design context, which is what the brief asked for ("can we research all of this and see how we could try to set this up").


Status definitions

  • Closed: a runnable artifact exists, has tests, and is documented.
  • ⚠️ Partial: closed in the literal sense but with documented spirit-gaps; concrete next-step is identified.
  • Open: documented but no runnable artifact.
  • 📋 Reference: research-only by design (e.g. paused upstream projects, substrates that the brief asked for as research not code).

What "Composer 2.5 quality" specifically requires (V8 honest)

To close V8 in spirit, not just letter, the framework needs:

  1. The architecture — done. Three-channel loss with TRL/VeRL recipes; SDPO via OPSD; trace-replay via OpenRouter.
  2. Real model + real GPU — done. Spike 002a-mini on 5090 sm_120, bf16, 50 steps.
  3. Real teacher rollouts at scale — Spike 002b: collect ~1000 traces × 3 teachers = ~$1000 OpenRouter spend. GPU-budget gated.
  4. A/B against plain GRPO on SWE-bench-lite — Spike 004. ~$100-200 GPU + judge calls.
  5. Decisive empirical result — only achievable after (3) and (4).

This is the post-replication phase. The CPU-only deep-work-loop phase (Waves 7-12) closes the architecture + installability + verification legs. The empirical leg requires money + time + a 7B+ model and is intentionally out of scope for the methodology phase.


How to verify each ✅ yourself

Clause Verification command
V1 cat research/01-composer-2.5.md docs/COMPOSER_RECIPE_MAPPING.md
V2 cd spikes/008-streaming-diloco && python -m pytest tests/ -q (5/5 pass)
V3 cat docs/INTEGRATION_ARCHITECTURE.md docs/V3_SUBSTRATE_COVERAGE.md
V4 pip install -e . && python examples/qwen_05b_quickstart/run.py
V5 cd spikes/007-real-trace-ingestion && python -m pytest tests/ -q
V6 cat spikes/001-teacher-replay-cost/verdict.md
V7 ls research/ docs/adrs/ docs/research/ docs/INTEGRATION_ARCHITECTURE.md
V8 cd spikes/002a-mini-gpu-smoke && python run_gpu_smoke.py (requires GPU)

References

  • docs/VISION_VALIDATION.md — original 10-point scorecard + post-Wave-11 honest re-scoring
  • docs/research/WAVE_7_10_FINAL_REVIEW.md — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed)
  • docs/adrs/ADR-001..007 — seven architectural decisions (GPU venue, trace source, DiLoCo impl, replaysim normalization, serverless DiLoCo, RL frameworks, distillation losses)
  • BACKLOG.md — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10

Wave 13 expansion (2026-05-26)

The user expanded the brief mid-loop:

"keep going. make sure that we do the paths of the Composer 2.5 methods, the n-teachers replaysim, and Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). … For V5 see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation. … if we can properly document and research the self-distillation papers like SDPO OPDS and/or others. … see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."

Wave 13 ask Artifact Status
Decoupled DiLoCo over serverless ADR-005 + composer_replication.diloco.serverless (Protocol + ObjectStoreAllReduce + LocalProcessExecutor + Modal/HFJobs skeletons) + 9 multi-process tests ✅ Closed (local) / 🟡 Skeleton (cloud)
Replaysim normalization ADR-004 + composer_replication.replaysim package + data-juicer adapter + default YAML recipe + 9 unit tests ✅ Closed (passthrough) / 🟡 Pending data-juicer install for full path
Other RL frameworks (V3 expansion) ADR-006 + composer_replication.recipes.prime_rl (recipe + composer_loss adapter + config.yaml) ✅ Closed (recipe) / 🟡 Skeleton (runtime)
Meta's PyTorch agentic stack ADR-006 + composer_replication.recipes.monarch (actor layout doc + skeleton actors) ✅ Closed (design) / 🟡 Skeleton (impl)
Deeper self-distillation research ADR-007 + docs/research/SELF_DISTILLATION_LANDSCAPE.md + composer_replication.distillation module (SimPO + TAID-rewritten + Entropy-Aware OPD) + tests ✅ Closed end-to-end — compose_loss kwargs wired in Wave 14; TAID rewritten in Wave 15 to match SakanaAI/TAID upstream (logit-space mix, current-student-detached anchor, forward-KL criterion, optional TAIDScheduler); OPSD parity test added against siyan-zhao/OPSD upstream.
altered-minds tie-in docs/ALTERED_MINDS_TIE_IN.md (5-phase plan, $300 estimate, open questions) ✅ Closed (design)

Wave 13 test addition: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).

The framework now covers the full expanded brief. Total tests passing post-Wave-15: 115 + 1 skip-marked. Wave-by-wave evolution: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-parameterized tests; OPSD upstream-parity test added skip-marked).

This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.