File size: 13,403 Bytes
d88715c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b266c31
d88715c
b266c31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5add15
b266c31
 
 
 
c11cf49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5add15
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# V1–V8 Coverage Matrix — Composer 2.5 Replication Framework

This document maps each of the 8 clauses of the original brief to **the
runnable artifact** (or honest gap) in this repo as of HEAD.

The brief, decomposed:

> [V1] dive into Composer 2.5 and understand what makes it so much better
> [V2] take that and combine it with diloco (decoupled, open, any variant of diloco)
> [V3] and monarch/torchforge/openenv/VeRL/TRL
> [V4] and make a framework that we can use to further RL training of models to take them to the next level
> [V5] One of the ideas that I had that might be a parallel to this is to use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do
> [V6] by doing this we get distillation data from any number of models that could be used to train the target model further
> [V7] can we research all of this and see how we could try to set this up as a framework
> [V8] to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5

## Coverage at-a-glance

| Clause | Status | Headline artifact | Notes |
|---|---|---|---|
| **V1** | ✅ Closed | `research/01-composer-2.5.md` + `docs/COMPOSER_RECIPE_MAPPING.md` + Spike 005 trainer skeleton | Identified SDPO/OPSD as Composer's secret sauce; traced to arXiv:2601.20802 (ICLR 2026); audited `siyan-zhao/OPSD` (MIT) for the loss kernel; lifted `generalized_jsd_loss` into our framework as `composer_replication.opsd.generalized_jsd_loss`. |
| **V2** | ⚠️ Partial | `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3) | Spike 008 verifies the outer-loop machinery + sign-convention on 1 replica. Cross-replica convergence is GPU-multi-process and not yet attempted. ADR-003 documents the choice. Wrapper is **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. |
| **V3** | ✅ Closed (research + recipes) | See § "V3 substrate coverage" below | Each substrate has a research deep-dive + an integration recipe. TRL has working code; VeRL has a config + adv-estimator skeleton; Monarch/TorchForge/OpenEnv are documented as reference patterns per the brief's "research" framing. |
| **V4** | ✅ Closed (installable) | `pip install -e .` ships `composer_replication` package | `pyproject.toml` at repo root; `examples/qwen_05b_quickstart/` runs end-to-end. The package re-exports the verified APIs from spike directories (loss, batch, opsd, teacher_replay, ingestion, trainer, diloco). |
| **V5** | ✅ Closed | `composer_replication.ingestion.ClaudeCodeIngester` + Spike 007 e2e test | Real Claude Code session JSONL → `TraceState` → `compose_loss` end-to-end smoke. ADR-002 documents the source choice + Claude Code circularity risk. 18 tests passing (15 unit + 3 e2e-with-loss). |
| **V6** | ✅ Closed | `composer_replication.teacher_replay.replay_trace` + Spike 001 verdict | Multi-teacher OpenRouter replay measured at $0.98/50-step trace, p95 latency 20.5s, 0 errors over 150 calls. Distillation data shape is `DPOPair(state_id, state_messages, chosen, rejected, n_teachers_agreeing)`. |
| **V7** | ✅ Closed | 5 research deep-dives + ADRs + integration architecture + working framework | The "research and see how" question is empirically answered: framework built, primary-source-validated, four production extension paths documented. Process is auditable. |
| **V8** | ⚠️ Partial | Spike 006 (CPU smoke) + Spike 002a-mini (GPU smoke) | Real `Qwen2.5-0.5B-Instruct` loads via `AutoModelForCausalLM`, runs through the 3-channel loss on both CPU (Spike 006) and GPU (Spike 002a-mini, RTX 5090, bf16, 5.3 GB peak VRAM, 480ms/step). The "Composer 2.5-quality results" half of V8 is GPU-budget-gated post-replication work (Spikes 002b/003/004). |

**Tally**: 6/8 closed, 2/8 partial. Both partials (V2 multi-process DiLoCo, V8 quality-of-results) are gated on GPU-multi-process work that is out of scope for the CPU-budget deep-work-loop phase.

---

## V3 substrate coverage (detailed)

V3 names six substrates: **monarch, torchforge, openenv, VeRL, TRL** (plus DiLoCo from V2). Each has a deep-dive research doc and an integration recipe. The "framework" target lives at the intersection of all of them.

| Substrate | Research deep-dive | Integration recipe | Working code | Notes |
|---|---|---|---|---|
| **TRL** (huggingface/trl) | `research/04-verl-trl.md` § 3 | `docs/INTEGRATION_ARCHITECTURE.md` Recipe A | ✅ `composer_replication.trainer.ComposerReplicationTrainer` subclasses `GRPOTrainer`. `_compute_loss` override composes 3 channels. | **Production target for v0.1.** DeepWiki-audited extension point: `GRPOTrainer._compute_loss(model, inputs)`. |
| **VeRL** (volcengine/verl) | `research/04-verl-trl.md` § 4 | `docs/INTEGRATION_ARCHITECTURE.md` Recipe B | 🟡 `spikes/005/verl_path/composer_adv.py` (110 LOC) + `composer_config.yaml` (89 LOC). Skeleton, not yet runnable. | **Production target for v0.2 scale (multi-node).** Extension point: `@register_adv_est(name)` decorator + `DataProto.batch`/`non_tensor_batch` for extra fields. |
| **DiLoCo** (meta-pytorch/torchft) | `research/02-diloco-family.md` (full DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 audit) | `docs/adrs/ADR-003-diloco-impl.md` | 🟡 `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3). Spike 008 has 5 single-process tests including sign-convention pin. | **Multi-replica convergence not yet tested** — single-process post-hook sequencing prevents this in CPU-only smoke. Real `torch.distributed` test deferred to GPU phase. |
| **OpenEnv** | `research/03-monarch-torchforge-openenv.md` § OpenEnv | `docs/INTEGRATION_ARCHITECTURE.md` Recipe D | 📋 Reference pattern, no code | Per the integration doc: "OpenEnv is a substrate, not a choice — it specifies how environments expose themselves to trainers." TRL accepts `environment_factory=` kwarg; VeRL has equivalent. **Not a code dependency for v0.1**; the framework's data path is OpenEnv-compatible by virtue of using TRL's API. |
| **Monarch** (Meta) | `research/03-monarch-torchforge-openenv.md` § Monarch | `docs/INTEGRATION_ARCHITECTURE.md` Recipe C | 📋 Reference pattern | Monarch is Meta's actor mesh — a coordination layer for distributed workers, not an algorithm. Per the research doc: "Monarch is alive, TorchForge is paused" (as of 2026-Q2). The framework's outer-loop sync via DiLoCo is an alternative coordination model that doesn't need Monarch. |
| **TorchForge** (Meta, paused) | `research/03-monarch-torchforge-openenv.md` § TorchForge | n/a (paused upstream) | 📋 Reference only | TorchForge as a project was paused by Meta. Research doc captures the design lessons; no code dependency. |

**Honest read**: TRL + VeRL + DiLoCo are the three substrates the framework actually integrates with. Monarch/TorchForge/OpenEnv are documented as informed-design context, which is what the brief asked for ("can we research all of this and see how we could try to set this up").

---

## Status definitions

-**Closed**: a runnable artifact exists, has tests, and is documented.
- ⚠️ **Partial**: closed in the literal sense but with documented spirit-gaps; concrete next-step is identified.
-**Open**: documented but no runnable artifact.
- 📋 **Reference**: research-only by design (e.g. paused upstream projects, substrates that the brief asked for as research not code).

---

## What "Composer 2.5 quality" specifically requires (V8 honest)

To close V8 in spirit, not just letter, the framework needs:

1.**The architecture** — done. Three-channel loss with TRL/VeRL recipes; SDPO via OPSD; trace-replay via OpenRouter.
2.**Real model + real GPU** — done. Spike 002a-mini on 5090 sm_120, bf16, 50 steps.
3. ❌ **Real teacher rollouts at scale** — Spike 002b: collect ~1000 traces × 3 teachers = ~$1000 OpenRouter spend. GPU-budget gated.
4. ❌ **A/B against plain GRPO on SWE-bench-lite** — Spike 004. ~$100-200 GPU + judge calls.
5. ❌ **Decisive empirical result** — only achievable after (3) and (4).

This is the post-replication phase. The CPU-only deep-work-loop phase (Waves 7-12) closes the **architecture + installability + verification** legs. The empirical leg requires money + time + a 7B+ model and is intentionally out of scope for the methodology phase.

---

## How to verify each ✅ yourself

| Clause | Verification command |
|---|---|
| V1 | `cat research/01-composer-2.5.md docs/COMPOSER_RECIPE_MAPPING.md` |
| V2 | `cd spikes/008-streaming-diloco && python -m pytest tests/ -q` (5/5 pass) |
| V3 | `cat docs/INTEGRATION_ARCHITECTURE.md docs/V3_SUBSTRATE_COVERAGE.md` |
| V4 | `pip install -e . && python examples/qwen_05b_quickstart/run.py` |
| V5 | `cd spikes/007-real-trace-ingestion && python -m pytest tests/ -q` |
| V6 | `cat spikes/001-teacher-replay-cost/verdict.md` |
| V7 | `ls research/ docs/adrs/ docs/research/ docs/INTEGRATION_ARCHITECTURE.md` |
| V8 | `cd spikes/002a-mini-gpu-smoke && python run_gpu_smoke.py` (requires GPU) |

---

## References

- `docs/VISION_VALIDATION.md` — original 10-point scorecard + post-Wave-11 honest re-scoring
- `docs/research/WAVE_7_10_FINAL_REVIEW.md` — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed)
- `docs/adrs/ADR-001..007` — seven architectural decisions (GPU venue, trace source, DiLoCo impl, replaysim normalization, serverless DiLoCo, RL frameworks, distillation losses)
- `BACKLOG.md` — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10

---

## Wave 13 expansion (2026-05-26)

The user expanded the brief mid-loop:

> *"keep going. make sure that we do the paths of the Composer 2.5 methods, the n-teachers replaysim, and Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). … For V5 see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation. … if we can properly document and research the self-distillation papers like SDPO OPDS and/or others. … see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."*

| Wave 13 ask | Artifact | Status |
|---|---|---|
| Decoupled DiLoCo over serverless | ADR-005 + `composer_replication.diloco.serverless` (Protocol + ObjectStoreAllReduce + LocalProcessExecutor + Modal/HFJobs skeletons) + 9 multi-process tests | ✅ Closed (local) / 🟡 Skeleton (cloud) |
| Replaysim normalization | ADR-004 + `composer_replication.replaysim` package + `data-juicer` adapter + default YAML recipe + 9 unit tests | ✅ Closed (passthrough) / 🟡 Pending data-juicer install for full path |
| Other RL frameworks (V3 expansion) | ADR-006 + `composer_replication.recipes.prime_rl` (recipe + composer_loss adapter + config.yaml) | ✅ Closed (recipe) / 🟡 Skeleton (runtime) |
| Meta's PyTorch agentic stack | ADR-006 + `composer_replication.recipes.monarch` (actor layout doc + skeleton actors) | ✅ Closed (design) / 🟡 Skeleton (impl) |
| Deeper self-distillation research | ADR-007 + `docs/research/SELF_DISTILLATION_LANDSCAPE.md` + `composer_replication.distillation` module (SimPO + TAID-rewritten + Entropy-Aware OPD) + tests | ✅ Closed end-to-end — `compose_loss` kwargs wired in Wave 14; TAID rewritten in Wave 15 to match SakanaAI/TAID upstream (logit-space mix, current-student-detached anchor, forward-KL criterion, optional `TAIDScheduler`); OPSD parity test added against `siyan-zhao/OPSD` upstream. |
| altered-minds tie-in | `docs/ALTERED_MINDS_TIE_IN.md` (5-phase plan, $300 estimate, open questions) | ✅ Closed (design) |

**Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).

The framework now covers the full expanded brief.

**Canonical test count (measured 2026-06-09 on this tree): 266 passed / 62 skipped / 328 collected.**
Wave-by-wave growth of the *passing-on-a-minimal-CPU-env* subset: 72 (W12) → 93 (W13)
→ 124 (W14) → 130 (W14b) → 115 (W15) → … → **266 (2026-06-09)** as later waves
(datagen, ADR-011/012/013/014, serverless, ingestion adapter) added subpackages and tests.

**Why the skip count varies by environment (and why older docs cite 115 / 176 / 210 / 232):**
the suite has ~328 collected tests; how many *run vs skip* depends on what optional
deps / host capabilities are present. Tests `skipif`-gate on: `torchft` (DiLoCo
integration — Linux-x86_64-only, absent on macOS arm64), `modal`, `data-juicer`,
`prime-rl`, the `/tmp/{opsd,taid}-clone` upstream-parity clones, a real Claude Code
session log, and a live **Docker** host. On a minimal CPU env many of those skip;
on a Docker-enabled host the substrate-E2E gates RUN (proven 2026-06-09). The
divergent historical numbers (115 Wave-15, 232/18, 210/16, 176/8) are point-in-time
snapshots under different dep/host matrices — they are not contradictions, but this
line is the one canonical figure; reproduce it with `pip install -e '.[dev]'` then
`pytest -q` (add `.[datagen]` + a Docker host to un-skip the substrate E2E).

This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.