Wave 15: 4-angle multi-model self-critique caught 2 math BLOCKERs in primary loss kernels; fixed against upstream byte-for-byte + GSM8K example + ergonomics

e5add15 12 days ago

preview code

raw

history blame contribute delete

12.4 kB

V1–V8 Coverage Matrix — Composer 2.5 Replication Framework

This document maps each of the 8 clauses of the original brief to the runnable artifact (or honest gap) in this repo as of HEAD.

The brief, decomposed:

[V1] dive into Composer 2.5 and understand what makes it so much better [V2] take that and combine it with diloco (decoupled, open, any variant of diloco) [V3] and monarch/torchforge/openenv/VeRL/TRL [V4] and make a framework that we can use to further RL training of models to take them to the next level [V5] One of the ideas that I had that might be a parallel to this is to use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do [V6] by doing this we get distillation data from any number of models that could be used to train the target model further [V7] can we research all of this and see how we could try to set this up as a framework [V8] to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5

Coverage at-a-glance

Clause	Status	Headline artifact	Notes
V1	✅ Closed	`research/01-composer-2.5.md` + `docs/COMPOSER_RECIPE_MAPPING.md` + Spike 005 trainer skeleton	Identified SDPO/OPSD as Composer's secret sauce; traced to arXiv:2601.20802 (ICLR 2026); audited `siyan-zhao/OPSD` (MIT) for the loss kernel; lifted `generalized_jsd_loss` into our framework as `composer_replication.opsd.generalized_jsd_loss`.
V2	⚠️ Partial	`composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3)	Spike 008 verifies the outer-loop machinery + sign-convention on 1 replica. Cross-replica convergence is GPU-multi-process and not yet attempted. ADR-003 documents the choice. Wrapper is not yet integrated with `ComposerReplicationTrainer` — it's an independent context manager.
V3	✅ Closed (research + recipes)	See § "V3 substrate coverage" below	Each substrate has a research deep-dive + an integration recipe. TRL has working code; VeRL has a config + adv-estimator skeleton; Monarch/TorchForge/OpenEnv are documented as reference patterns per the brief's "research" framing.
V4	✅ Closed (installable)	`pip install -e .` ships `composer_replication` package	`pyproject.toml` at repo root; `examples/qwen_05b_quickstart/` runs end-to-end. The package re-exports the verified APIs from spike directories (loss, batch, opsd, teacher_replay, ingestion, trainer, diloco).
V5	✅ Closed	`composer_replication.ingestion.ClaudeCodeIngester` + Spike 007 e2e test	Real Claude Code session JSONL → `TraceState` → `compose_loss` end-to-end smoke. ADR-002 documents the source choice + Claude Code circularity risk. 18 tests passing (15 unit + 3 e2e-with-loss).
V6	✅ Closed	`composer_replication.teacher_replay.replay_trace` + Spike 001 verdict	Multi-teacher OpenRouter replay measured at $0.98/50-step trace, p95 latency 20.5s, 0 errors over 150 calls. Distillation data shape is `DPOPair(state_id, state_messages, chosen, rejected, n_teachers_agreeing)`.
V7	✅ Closed	5 research deep-dives + ADRs + integration architecture + working framework	The "research and see how" question is empirically answered: framework built, primary-source-validated, four production extension paths documented. Process is auditable.
V8	⚠️ Partial	Spike 006 (CPU smoke) + Spike 002a-mini (GPU smoke)	Real `Qwen2.5-0.5B-Instruct` loads via `AutoModelForCausalLM`, runs through the 3-channel loss on both CPU (Spike 006) and GPU (Spike 002a-mini, RTX 5090, bf16, 5.3 GB peak VRAM, 480ms/step). The "Composer 2.5-quality results" half of V8 is GPU-budget-gated post-replication work (Spikes 002b/003/004).

Tally: 6/8 closed, 2/8 partial. Both partials (V2 multi-process DiLoCo, V8 quality-of-results) are gated on GPU-multi-process work that is out of scope for the CPU-budget deep-work-loop phase.

V3 substrate coverage (detailed)

V3 names six substrates: monarch, torchforge, openenv, VeRL, TRL (plus DiLoCo from V2). Each has a deep-dive research doc and an integration recipe. The "framework" target lives at the intersection of all of them.

Substrate	Research deep-dive	Integration recipe	Working code	Notes
TRL (huggingface/trl)	`research/04-verl-trl.md` § 3	`docs/INTEGRATION_ARCHITECTURE.md` Recipe A	✅ `composer_replication.trainer.ComposerReplicationTrainer` subclasses `GRPOTrainer`. `_compute_loss` override composes 3 channels.	Production target for v0.1. DeepWiki-audited extension point: `GRPOTrainer._compute_loss(model, inputs)`.
VeRL (volcengine/verl)	`research/04-verl-trl.md` § 4	`docs/INTEGRATION_ARCHITECTURE.md` Recipe B	🟡 `spikes/005/verl_path/composer_adv.py` (110 LOC) + `composer_config.yaml` (89 LOC). Skeleton, not yet runnable.	Production target for v0.2 scale (multi-node). Extension point: `@register_adv_est(name)` decorator + `DataProto.batch`/`non_tensor_batch` for extra fields.
DiLoCo (meta-pytorch/torchft)	`research/02-diloco-family.md` (full DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 audit)	`docs/adrs/ADR-003-diloco-impl.md`	🟡 `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3). Spike 008 has 5 single-process tests including sign-convention pin.	Multi-replica convergence not yet tested — single-process post-hook sequencing prevents this in CPU-only smoke. Real `torch.distributed` test deferred to GPU phase.
OpenEnv	`research/03-monarch-torchforge-openenv.md` § OpenEnv	`docs/INTEGRATION_ARCHITECTURE.md` Recipe D	📋 Reference pattern, no code	Per the integration doc: "OpenEnv is a substrate, not a choice — it specifies how environments expose themselves to trainers." TRL accepts `environment_factory=` kwarg; VeRL has equivalent. Not a code dependency for v0.1; the framework's data path is OpenEnv-compatible by virtue of using TRL's API.
Monarch (Meta)	`research/03-monarch-torchforge-openenv.md` § Monarch	`docs/INTEGRATION_ARCHITECTURE.md` Recipe C	📋 Reference pattern	Monarch is Meta's actor mesh — a coordination layer for distributed workers, not an algorithm. Per the research doc: "Monarch is alive, TorchForge is paused" (as of 2026-Q2). The framework's outer-loop sync via DiLoCo is an alternative coordination model that doesn't need Monarch.
TorchForge (Meta, paused)	`research/03-monarch-torchforge-openenv.md` § TorchForge	n/a (paused upstream)	📋 Reference only	TorchForge as a project was paused by Meta. Research doc captures the design lessons; no code dependency.

Honest read: TRL + VeRL + DiLoCo are the three substrates the framework actually integrates with. Monarch/TorchForge/OpenEnv are documented as informed-design context, which is what the brief asked for ("can we research all of this and see how we could try to set this up").

Status definitions

✅ Closed: a runnable artifact exists, has tests, and is documented.
⚠️ Partial: closed in the literal sense but with documented spirit-gaps; concrete next-step is identified.
❌ Open: documented but no runnable artifact.
📋 Reference: research-only by design (e.g. paused upstream projects, substrates that the brief asked for as research not code).

What "Composer 2.5 quality" specifically requires (V8 honest)

To close V8 in spirit, not just letter, the framework needs:

✅ The architecture — done. Three-channel loss with TRL/VeRL recipes; SDPO via OPSD; trace-replay via OpenRouter.
✅ Real model + real GPU — done. Spike 002a-mini on 5090 sm_120, bf16, 50 steps.
❌ Real teacher rollouts at scale — Spike 002b: collect ~1000 traces × 3 teachers = ~$1000 OpenRouter spend. GPU-budget gated.
❌ A/B against plain GRPO on SWE-bench-lite — Spike 004. ~$100-200 GPU + judge calls.
❌ Decisive empirical result — only achievable after (3) and (4).

This is the post-replication phase. The CPU-only deep-work-loop phase (Waves 7-12) closes the architecture + installability + verification legs. The empirical leg requires money + time + a 7B+ model and is intentionally out of scope for the methodology phase.

How to verify each ✅ yourself

Clause	Verification command
V1	`cat research/01-composer-2.5.md docs/COMPOSER_RECIPE_MAPPING.md`
V2	`cd spikes/008-streaming-diloco && python -m pytest tests/ -q` (5/5 pass)
V3	`cat docs/INTEGRATION_ARCHITECTURE.md docs/V3_SUBSTRATE_COVERAGE.md`
V4	`pip install -e . && python examples/qwen_05b_quickstart/run.py`
V5	`cd spikes/007-real-trace-ingestion && python -m pytest tests/ -q`
V6	`cat spikes/001-teacher-replay-cost/verdict.md`
V7	`ls research/ docs/adrs/ docs/research/ docs/INTEGRATION_ARCHITECTURE.md`
V8	`cd spikes/002a-mini-gpu-smoke && python run_gpu_smoke.py` (requires GPU)

References

docs/VISION_VALIDATION.md — original 10-point scorecard + post-Wave-11 honest re-scoring
docs/research/WAVE_7_10_FINAL_REVIEW.md — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed)
docs/adrs/ADR-001..007 — seven architectural decisions (GPU venue, trace source, DiLoCo impl, replaysim normalization, serverless DiLoCo, RL frameworks, distillation losses)
BACKLOG.md — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10

Wave 13 expansion (2026-05-26)

The user expanded the brief mid-loop:

"keep going. make sure that we do the paths of the Composer 2.5 methods, the n-teachers replaysim, and Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). … For V5 see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation. … if we can properly document and research the self-distillation papers like SDPO OPDS and/or others. … see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."

Wave 13 ask	Artifact	Status
Decoupled DiLoCo over serverless	ADR-005 + `composer_replication.diloco.serverless` (Protocol + ObjectStoreAllReduce + LocalProcessExecutor + Modal/HFJobs skeletons) + 9 multi-process tests	✅ Closed (local) / 🟡 Skeleton (cloud)
Replaysim normalization	ADR-004 + `composer_replication.replaysim` package + `data-juicer` adapter + default YAML recipe + 9 unit tests	✅ Closed (passthrough) / 🟡 Pending data-juicer install for full path
Other RL frameworks (V3 expansion)	ADR-006 + `composer_replication.recipes.prime_rl` (recipe + composer_loss adapter + config.yaml)	✅ Closed (recipe) / 🟡 Skeleton (runtime)
Meta's PyTorch agentic stack	ADR-006 + `composer_replication.recipes.monarch` (actor layout doc + skeleton actors)	✅ Closed (design) / 🟡 Skeleton (impl)
Deeper self-distillation research	ADR-007 + `docs/research/SELF_DISTILLATION_LANDSCAPE.md` + `composer_replication.distillation` module (SimPO + TAID-rewritten + Entropy-Aware OPD) + tests	✅ Closed end-to-end — `compose_loss` kwargs wired in Wave 14; TAID rewritten in Wave 15 to match SakanaAI/TAID upstream (logit-space mix, current-student-detached anchor, forward-KL criterion, optional `TAIDScheduler`); OPSD parity test added against `siyan-zhao/OPSD` upstream.
altered-minds tie-in	`docs/ALTERED_MINDS_TIE_IN.md` (5-phase plan, $300 estimate, open questions)	✅ Closed (design)

Wave 13 test addition: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).

The framework now covers the full expanded brief. Total tests passing post-Wave-15: 115 + 1 skip-marked. Wave-by-wave evolution: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-parameterized tests; OPSD upstream-parity test added skip-marked).

This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.