Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # V1–V8 Coverage Matrix — Composer 2.5 Replication Framework | |
| This document maps each of the 8 clauses of the original brief to **the | |
| runnable artifact** (or honest gap) in this repo as of HEAD. | |
| The brief, decomposed: | |
| > [V1] dive into Composer 2.5 and understand what makes it so much better | |
| > [V2] take that and combine it with diloco (decoupled, open, any variant of diloco) | |
| > [V3] and monarch/torchforge/openenv/VeRL/TRL | |
| > [V4] and make a framework that we can use to further RL training of models to take them to the next level | |
| > [V5] One of the ideas that I had that might be a parallel to this is to use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do | |
| > [V6] by doing this we get distillation data from any number of models that could be used to train the target model further | |
| > [V7] can we research all of this and see how we could try to set this up as a framework | |
| > [V8] to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5 | |
| ## Coverage at-a-glance | |
| | Clause | Status | Headline artifact | Notes | | |
| |---|---|---|---| | |
| | **V1** | ✅ Closed | `research/01-composer-2.5.md` + `docs/COMPOSER_RECIPE_MAPPING.md` + Spike 005 trainer skeleton | Identified SDPO/OPSD as Composer's secret sauce; traced to arXiv:2601.20802 (ICLR 2026); audited `siyan-zhao/OPSD` (MIT) for the loss kernel; lifted `generalized_jsd_loss` into our framework as `composer_replication.opsd.generalized_jsd_loss`. | | |
| | **V2** | ⚠️ Partial | `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3) | Spike 008 verifies the outer-loop machinery + sign-convention on 1 replica. Cross-replica convergence is GPU-multi-process and not yet attempted. ADR-003 documents the choice. Wrapper is **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. | | |
| | **V3** | ✅ Closed (research + recipes) | See § "V3 substrate coverage" below | Each substrate has a research deep-dive + an integration recipe. TRL has working code; VeRL has a config + adv-estimator skeleton; Monarch/TorchForge/OpenEnv are documented as reference patterns per the brief's "research" framing. | | |
| | **V4** | ✅ Closed (installable) | `pip install -e .` ships `composer_replication` package | `pyproject.toml` at repo root; `examples/qwen_05b_quickstart/` runs end-to-end. The package re-exports the verified APIs from spike directories (loss, batch, opsd, teacher_replay, ingestion, trainer, diloco). | | |
| | **V5** | ✅ Closed | `composer_replication.ingestion.ClaudeCodeIngester` + Spike 007 e2e test | Real Claude Code session JSONL → `TraceState` → `compose_loss` end-to-end smoke. ADR-002 documents the source choice + Claude Code circularity risk. 18 tests passing (15 unit + 3 e2e-with-loss). | | |
| | **V6** | ✅ Closed | `composer_replication.teacher_replay.replay_trace` + Spike 001 verdict | Multi-teacher OpenRouter replay measured at $0.98/50-step trace, p95 latency 20.5s, 0 errors over 150 calls. Distillation data shape is `DPOPair(state_id, state_messages, chosen, rejected, n_teachers_agreeing)`. | | |
| | **V7** | ✅ Closed | 5 research deep-dives + ADRs + integration architecture + working framework | The "research and see how" question is empirically answered: framework built, primary-source-validated, four production extension paths documented. Process is auditable. | | |
| | **V8** | ⚠️ Partial | Spike 006 (CPU smoke) + Spike 002a-mini (GPU smoke) | Real `Qwen2.5-0.5B-Instruct` loads via `AutoModelForCausalLM`, runs through the 3-channel loss on both CPU (Spike 006) and GPU (Spike 002a-mini, RTX 5090, bf16, 5.3 GB peak VRAM, 480ms/step). The "Composer 2.5-quality results" half of V8 is GPU-budget-gated post-replication work (Spikes 002b/003/004). | | |
| **Tally**: 6/8 closed, 2/8 partial. Both partials (V2 multi-process DiLoCo, V8 quality-of-results) are gated on GPU-multi-process work that is out of scope for the CPU-budget deep-work-loop phase. | |
| --- | |
| ## V3 substrate coverage (detailed) | |
| V3 names six substrates: **monarch, torchforge, openenv, VeRL, TRL** (plus DiLoCo from V2). Each has a deep-dive research doc and an integration recipe. The "framework" target lives at the intersection of all of them. | |
| | Substrate | Research deep-dive | Integration recipe | Working code | Notes | | |
| |---|---|---|---|---| | |
| | **TRL** (huggingface/trl) | `research/04-verl-trl.md` § 3 | `docs/INTEGRATION_ARCHITECTURE.md` Recipe A | ✅ `composer_replication.trainer.ComposerReplicationTrainer` subclasses `GRPOTrainer`. `_compute_loss` override composes 3 channels. | **Production target for v0.1.** DeepWiki-audited extension point: `GRPOTrainer._compute_loss(model, inputs)`. | | |
| | **VeRL** (volcengine/verl) | `research/04-verl-trl.md` § 4 | `docs/INTEGRATION_ARCHITECTURE.md` Recipe B | 🟡 `spikes/005/verl_path/composer_adv.py` (110 LOC) + `composer_config.yaml` (89 LOC). Skeleton, not yet runnable. | **Production target for v0.2 scale (multi-node).** Extension point: `@register_adv_est(name)` decorator + `DataProto.batch`/`non_tensor_batch` for extra fields. | | |
| | **DiLoCo** (meta-pytorch/torchft) | `research/02-diloco-family.md` (full DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 audit) | `docs/adrs/ADR-003-diloco-impl.md` | 🟡 `composer_replication.diloco.make_diloco_outer_loop` wraps `torchft.local_sgd.DiLoCo` (BSD-3). Spike 008 has 5 single-process tests including sign-convention pin. | **Multi-replica convergence not yet tested** — single-process post-hook sequencing prevents this in CPU-only smoke. Real `torch.distributed` test deferred to GPU phase. | | |
| | **OpenEnv** | `research/03-monarch-torchforge-openenv.md` § OpenEnv | `docs/INTEGRATION_ARCHITECTURE.md` Recipe D | 📋 Reference pattern, no code | Per the integration doc: "OpenEnv is a substrate, not a choice — it specifies how environments expose themselves to trainers." TRL accepts `environment_factory=` kwarg; VeRL has equivalent. **Not a code dependency for v0.1**; the framework's data path is OpenEnv-compatible by virtue of using TRL's API. | | |
| | **Monarch** (Meta) | `research/03-monarch-torchforge-openenv.md` § Monarch | `docs/INTEGRATION_ARCHITECTURE.md` Recipe C | 📋 Reference pattern | Monarch is Meta's actor mesh — a coordination layer for distributed workers, not an algorithm. Per the research doc: "Monarch is alive, TorchForge is paused" (as of 2026-Q2). The framework's outer-loop sync via DiLoCo is an alternative coordination model that doesn't need Monarch. | | |
| | **TorchForge** (Meta, paused) | `research/03-monarch-torchforge-openenv.md` § TorchForge | n/a (paused upstream) | 📋 Reference only | TorchForge as a project was paused by Meta. Research doc captures the design lessons; no code dependency. | | |
| **Honest read**: TRL + VeRL + DiLoCo are the three substrates the framework actually integrates with. Monarch/TorchForge/OpenEnv are documented as informed-design context, which is what the brief asked for ("can we research all of this and see how we could try to set this up"). | |
| --- | |
| ## Status definitions | |
| - ✅ **Closed**: a runnable artifact exists, has tests, and is documented. | |
| - ⚠️ **Partial**: closed in the literal sense but with documented spirit-gaps; concrete next-step is identified. | |
| - ❌ **Open**: documented but no runnable artifact. | |
| - 📋 **Reference**: research-only by design (e.g. paused upstream projects, substrates that the brief asked for as research not code). | |
| --- | |
| ## What "Composer 2.5 quality" specifically requires (V8 honest) | |
| To close V8 in spirit, not just letter, the framework needs: | |
| 1. ✅ **The architecture** — done. Three-channel loss with TRL/VeRL recipes; SDPO via OPSD; trace-replay via OpenRouter. | |
| 2. ✅ **Real model + real GPU** — done. Spike 002a-mini on 5090 sm_120, bf16, 50 steps. | |
| 3. ❌ **Real teacher rollouts at scale** — Spike 002b: collect ~1000 traces × 3 teachers = ~$1000 OpenRouter spend. GPU-budget gated. | |
| 4. ❌ **A/B against plain GRPO on SWE-bench-lite** — Spike 004. ~$100-200 GPU + judge calls. | |
| 5. ❌ **Decisive empirical result** — only achievable after (3) and (4). | |
| This is the post-replication phase. The CPU-only deep-work-loop phase (Waves 7-12) closes the **architecture + installability + verification** legs. The empirical leg requires money + time + a 7B+ model and is intentionally out of scope for the methodology phase. | |
| --- | |
| ## How to verify each ✅ yourself | |
| | Clause | Verification command | | |
| |---|---| | |
| | V1 | `cat research/01-composer-2.5.md docs/COMPOSER_RECIPE_MAPPING.md` | | |
| | V2 | `cd spikes/008-streaming-diloco && python -m pytest tests/ -q` (5/5 pass) | | |
| | V3 | `cat docs/INTEGRATION_ARCHITECTURE.md docs/V3_SUBSTRATE_COVERAGE.md` | | |
| | V4 | `pip install -e . && python examples/qwen_05b_quickstart/run.py` | | |
| | V5 | `cd spikes/007-real-trace-ingestion && python -m pytest tests/ -q` | | |
| | V6 | `cat spikes/001-teacher-replay-cost/verdict.md` | | |
| | V7 | `ls research/ docs/adrs/ docs/research/ docs/INTEGRATION_ARCHITECTURE.md` | | |
| | V8 | `cd spikes/002a-mini-gpu-smoke && python run_gpu_smoke.py` (requires GPU) | | |
| --- | |
| ## References | |
| - `docs/VISION_VALIDATION.md` — original 10-point scorecard + post-Wave-11 honest re-scoring | |
| - `docs/research/WAVE_7_10_FINAL_REVIEW.md` — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed) | |
| - `docs/adrs/ADR-001..007` — seven architectural decisions (GPU venue, trace source, DiLoCo impl, replaysim normalization, serverless DiLoCo, RL frameworks, distillation losses) | |
| - `BACKLOG.md` — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10 | |
| --- | |
| ## Wave 13 expansion (2026-05-26) | |
| The user expanded the brief mid-loop: | |
| > *"keep going. make sure that we do the paths of the Composer 2.5 methods, the n-teachers replaysim, and Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). … For V5 see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation. … if we can properly document and research the self-distillation papers like SDPO OPDS and/or others. … see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."* | |
| | Wave 13 ask | Artifact | Status | | |
| |---|---|---| | |
| | Decoupled DiLoCo over serverless | ADR-005 + `composer_replication.diloco.serverless` (Protocol + ObjectStoreAllReduce + LocalProcessExecutor + Modal/HFJobs skeletons) + 9 multi-process tests | ✅ Closed (local) / 🟡 Skeleton (cloud) | | |
| | Replaysim normalization | ADR-004 + `composer_replication.replaysim` package + `data-juicer` adapter + default YAML recipe + 9 unit tests | ✅ Closed (passthrough) / 🟡 Pending data-juicer install for full path | | |
| | Other RL frameworks (V3 expansion) | ADR-006 + `composer_replication.recipes.prime_rl` (recipe + composer_loss adapter + config.yaml) | ✅ Closed (recipe) / 🟡 Skeleton (runtime) | | |
| | Meta's PyTorch agentic stack | ADR-006 + `composer_replication.recipes.monarch` (actor layout doc + skeleton actors) | ✅ Closed (design) / 🟡 Skeleton (impl) | | |
| | Deeper self-distillation research | ADR-007 + `docs/research/SELF_DISTILLATION_LANDSCAPE.md` + `composer_replication.distillation` module (SimPO + TAID-rewritten + Entropy-Aware OPD) + tests | ✅ Closed end-to-end — `compose_loss` kwargs wired in Wave 14; TAID rewritten in Wave 15 to match SakanaAI/TAID upstream (logit-space mix, current-student-detached anchor, forward-KL criterion, optional `TAIDScheduler`); OPSD parity test added against `siyan-zhao/OPSD` upstream. | | |
| | altered-minds tie-in | `docs/ALTERED_MINDS_TIE_IN.md` (5-phase plan, $300 estimate, open questions) | ✅ Closed (design) | | |
| **Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim). | |
| The framework now covers the full expanded brief. **Total tests passing | |
| post-Wave-15: 115 + 1 skip-marked.** Wave-by-wave evolution: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-parameterized tests; OPSD upstream-parity test added skip-marked). | |
| This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating. | |