composer-replication-framework / docs /V1_V8_COVERAGE.md
Codeseys's picture
Wave 15: 4-angle multi-model self-critique caught 2 math BLOCKERs in primary loss kernels; fixed against upstream byte-for-byte + GSM8K example + ergonomics
e5add15
# V1–V8 Coverage Matrix — Composer 2.5 Replication Framework
This document maps each of the 8 clauses of the original brief to **the
runnable artifact** (or honest gap) in this repo as of HEAD.
The brief, decomposed:
> [V1] dive into Composer 2.5 and understand what makes it so much better
> [V2] take that and combine it with diloco (decoupled, open, any variant of diloco)
> [V3] and monarch/torchforge/openenv/VeRL/TRL
> [V4] and make a framework that we can use to further RL training of models to take them to the next level
> [V5] One of the ideas that I had that might be a parallel to this is to use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do
> [V6] by doing this we get distillation data from any number of models that could be used to train the target model further
> [V7] can we research all of this and see how we could try to set this up as a framework
> [V8] to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5
## Coverage at-a-glance
| Clause | Status | Headline artifact | Notes |
|---|---|---|---|
| **V1** | ✅ Closed | `research/01-composer-2.5.md` + `docs/COMPOSER_RECIPE_MAPPING.md` + Spike 005 trainer skeleton | Identified SDPO/OPSD as Composer's secret sauce; traced to arXiv:2601.20802 (ICLR 2026); audited `siyan-zhao/OPSD` (MIT) for the loss kernel; lifted `generalized_jsd_loss` into our framework as `composer_replication.opsd.generalized_jsd_loss`. |
| **V2** | ⚠️ Partial | `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3) | Spike 008 verifies the outer-loop machinery + sign-convention on 1 replica. Cross-replica convergence is GPU-multi-process and not yet attempted. ADR-003 documents the choice. Wrapper is **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. |
| **V3** | ✅ Closed (research + recipes) | See § "V3 substrate coverage" below | Each substrate has a research deep-dive + an integration recipe. TRL has working code; VeRL has a config + adv-estimator skeleton; Monarch/TorchForge/OpenEnv are documented as reference patterns per the brief's "research" framing. |
| **V4** | ✅ Closed (installable) | `pip install -e .` ships `composer_replication` package | `pyproject.toml` at repo root; `examples/qwen_05b_quickstart/` runs end-to-end. The package re-exports the verified APIs from spike directories (loss, batch, opsd, teacher_replay, ingestion, trainer, diloco). |
| **V5** | ✅ Closed | `composer_replication.ingestion.ClaudeCodeIngester` + Spike 007 e2e test | Real Claude Code session JSONL → `TraceState` → `compose_loss` end-to-end smoke. ADR-002 documents the source choice + Claude Code circularity risk. 18 tests passing (15 unit + 3 e2e-with-loss). |
| **V6** | ✅ Closed | `composer_replication.teacher_replay.replay_trace` + Spike 001 verdict | Multi-teacher OpenRouter replay measured at $0.98/50-step trace, p95 latency 20.5s, 0 errors over 150 calls. Distillation data shape is `DPOPair(state_id, state_messages, chosen, rejected, n_teachers_agreeing)`. |
| **V7** | ✅ Closed | 5 research deep-dives + ADRs + integration architecture + working framework | The "research and see how" question is empirically answered: framework built, primary-source-validated, four production extension paths documented. Process is auditable. |
| **V8** | ⚠️ Partial | Spike 006 (CPU smoke) + Spike 002a-mini (GPU smoke) | Real `Qwen2.5-0.5B-Instruct` loads via `AutoModelForCausalLM`, runs through the 3-channel loss on both CPU (Spike 006) and GPU (Spike 002a-mini, RTX 5090, bf16, 5.3 GB peak VRAM, 480ms/step). The "Composer 2.5-quality results" half of V8 is GPU-budget-gated post-replication work (Spikes 002b/003/004). |
**Tally**: 6/8 closed, 2/8 partial. Both partials (V2 multi-process DiLoCo, V8 quality-of-results) are gated on GPU-multi-process work that is out of scope for the CPU-budget deep-work-loop phase.
---
## V3 substrate coverage (detailed)
V3 names six substrates: **monarch, torchforge, openenv, VeRL, TRL** (plus DiLoCo from V2). Each has a deep-dive research doc and an integration recipe. The "framework" target lives at the intersection of all of them.
| Substrate | Research deep-dive | Integration recipe | Working code | Notes |
|---|---|---|---|---|
| **TRL** (huggingface/trl) | `research/04-verl-trl.md` § 3 | `docs/INTEGRATION_ARCHITECTURE.md` Recipe A | ✅ `composer_replication.trainer.ComposerReplicationTrainer` subclasses `GRPOTrainer`. `_compute_loss` override composes 3 channels. | **Production target for v0.1.** DeepWiki-audited extension point: `GRPOTrainer._compute_loss(model, inputs)`. |
| **VeRL** (volcengine/verl) | `research/04-verl-trl.md` § 4 | `docs/INTEGRATION_ARCHITECTURE.md` Recipe B | 🟡 `spikes/005/verl_path/composer_adv.py` (110 LOC) + `composer_config.yaml` (89 LOC). Skeleton, not yet runnable. | **Production target for v0.2 scale (multi-node).** Extension point: `@register_adv_est(name)` decorator + `DataProto.batch`/`non_tensor_batch` for extra fields. |
| **DiLoCo** (meta-pytorch/torchft) | `research/02-diloco-family.md` (full DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 audit) | `docs/adrs/ADR-003-diloco-impl.md` | 🟡 `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3). Spike 008 has 5 single-process tests including sign-convention pin. | **Multi-replica convergence not yet tested** — single-process post-hook sequencing prevents this in CPU-only smoke. Real `torch.distributed` test deferred to GPU phase. |
| **OpenEnv** | `research/03-monarch-torchforge-openenv.md` § OpenEnv | `docs/INTEGRATION_ARCHITECTURE.md` Recipe D | 📋 Reference pattern, no code | Per the integration doc: "OpenEnv is a substrate, not a choice — it specifies how environments expose themselves to trainers." TRL accepts `environment_factory=` kwarg; VeRL has equivalent. **Not a code dependency for v0.1**; the framework's data path is OpenEnv-compatible by virtue of using TRL's API. |
| **Monarch** (Meta) | `research/03-monarch-torchforge-openenv.md` § Monarch | `docs/INTEGRATION_ARCHITECTURE.md` Recipe C | 📋 Reference pattern | Monarch is Meta's actor mesh — a coordination layer for distributed workers, not an algorithm. Per the research doc: "Monarch is alive, TorchForge is paused" (as of 2026-Q2). The framework's outer-loop sync via DiLoCo is an alternative coordination model that doesn't need Monarch. |
| **TorchForge** (Meta, paused) | `research/03-monarch-torchforge-openenv.md` § TorchForge | n/a (paused upstream) | 📋 Reference only | TorchForge as a project was paused by Meta. Research doc captures the design lessons; no code dependency. |
**Honest read**: TRL + VeRL + DiLoCo are the three substrates the framework actually integrates with. Monarch/TorchForge/OpenEnv are documented as informed-design context, which is what the brief asked for ("can we research all of this and see how we could try to set this up").
---
## Status definitions
-**Closed**: a runnable artifact exists, has tests, and is documented.
- ⚠️ **Partial**: closed in the literal sense but with documented spirit-gaps; concrete next-step is identified.
-**Open**: documented but no runnable artifact.
- 📋 **Reference**: research-only by design (e.g. paused upstream projects, substrates that the brief asked for as research not code).
---
## What "Composer 2.5 quality" specifically requires (V8 honest)
To close V8 in spirit, not just letter, the framework needs:
1.**The architecture** — done. Three-channel loss with TRL/VeRL recipes; SDPO via OPSD; trace-replay via OpenRouter.
2.**Real model + real GPU** — done. Spike 002a-mini on 5090 sm_120, bf16, 50 steps.
3. ❌ **Real teacher rollouts at scale** — Spike 002b: collect ~1000 traces × 3 teachers = ~$1000 OpenRouter spend. GPU-budget gated.
4. ❌ **A/B against plain GRPO on SWE-bench-lite** — Spike 004. ~$100-200 GPU + judge calls.
5. ❌ **Decisive empirical result** — only achievable after (3) and (4).
This is the post-replication phase. The CPU-only deep-work-loop phase (Waves 7-12) closes the **architecture + installability + verification** legs. The empirical leg requires money + time + a 7B+ model and is intentionally out of scope for the methodology phase.
---
## How to verify each ✅ yourself
| Clause | Verification command |
|---|---|
| V1 | `cat research/01-composer-2.5.md docs/COMPOSER_RECIPE_MAPPING.md` |
| V2 | `cd spikes/008-streaming-diloco && python -m pytest tests/ -q` (5/5 pass) |
| V3 | `cat docs/INTEGRATION_ARCHITECTURE.md docs/V3_SUBSTRATE_COVERAGE.md` |
| V4 | `pip install -e . && python examples/qwen_05b_quickstart/run.py` |
| V5 | `cd spikes/007-real-trace-ingestion && python -m pytest tests/ -q` |
| V6 | `cat spikes/001-teacher-replay-cost/verdict.md` |
| V7 | `ls research/ docs/adrs/ docs/research/ docs/INTEGRATION_ARCHITECTURE.md` |
| V8 | `cd spikes/002a-mini-gpu-smoke && python run_gpu_smoke.py` (requires GPU) |
---
## References
- `docs/VISION_VALIDATION.md` — original 10-point scorecard + post-Wave-11 honest re-scoring
- `docs/research/WAVE_7_10_FINAL_REVIEW.md` — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed)
- `docs/adrs/ADR-001..007` — seven architectural decisions (GPU venue, trace source, DiLoCo impl, replaysim normalization, serverless DiLoCo, RL frameworks, distillation losses)
- `BACKLOG.md` — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10
---
## Wave 13 expansion (2026-05-26)
The user expanded the brief mid-loop:
> *"keep going. make sure that we do the paths of the Composer 2.5 methods, the n-teachers replaysim, and Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). … For V5 see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation. … if we can properly document and research the self-distillation papers like SDPO OPDS and/or others. … see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."*
| Wave 13 ask | Artifact | Status |
|---|---|---|
| Decoupled DiLoCo over serverless | ADR-005 + `composer_replication.diloco.serverless` (Protocol + ObjectStoreAllReduce + LocalProcessExecutor + Modal/HFJobs skeletons) + 9 multi-process tests | ✅ Closed (local) / 🟡 Skeleton (cloud) |
| Replaysim normalization | ADR-004 + `composer_replication.replaysim` package + `data-juicer` adapter + default YAML recipe + 9 unit tests | ✅ Closed (passthrough) / 🟡 Pending data-juicer install for full path |
| Other RL frameworks (V3 expansion) | ADR-006 + `composer_replication.recipes.prime_rl` (recipe + composer_loss adapter + config.yaml) | ✅ Closed (recipe) / 🟡 Skeleton (runtime) |
| Meta's PyTorch agentic stack | ADR-006 + `composer_replication.recipes.monarch` (actor layout doc + skeleton actors) | ✅ Closed (design) / 🟡 Skeleton (impl) |
| Deeper self-distillation research | ADR-007 + `docs/research/SELF_DISTILLATION_LANDSCAPE.md` + `composer_replication.distillation` module (SimPO + TAID-rewritten + Entropy-Aware OPD) + tests | ✅ Closed end-to-end — `compose_loss` kwargs wired in Wave 14; TAID rewritten in Wave 15 to match SakanaAI/TAID upstream (logit-space mix, current-student-detached anchor, forward-KL criterion, optional `TAIDScheduler`); OPSD parity test added against `siyan-zhao/OPSD` upstream. |
| altered-minds tie-in | `docs/ALTERED_MINDS_TIE_IN.md` (5-phase plan, $300 estimate, open questions) | ✅ Closed (design) |
**Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).
The framework now covers the full expanded brief. **Total tests passing
post-Wave-15: 115 + 1 skip-marked.** Wave-by-wave evolution: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-parameterized tests; OPSD upstream-parity test added skip-marked).
This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.