Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs(wave3): add OVERVIEW.md, index ADR-014, fold in adversarial-review fixes
Browse filesNew-docs + freshness wave, plus corrections from an adversarial review pass
on the wave1 commit. Docs-only; no source files touched.
New:
- docs/OVERVIEW.md — a 5-minute, honestly-scoped newcomer tour: what the
framework is, the three channels WITH provenance (1 Dr.GRPO + 2 SDPO =
genuine Composer replication; 3 trace-replay-DPO = the framework's own
additive channel, NOT Cursor's recipe), what's proven (CPU SDPO-fires, the
A1 8B Modal run, GSM8K GRPO, $0.98/trace), and what's gapped (Docker e2e,
A2-A4 ladder). Linked from README + both _archive READMEs.
ADR index:
- Add the missing ADR-014 row + a provenance note recording the channel-3
correction it carries.
Adversarial-review corrections (to the wave1 edits):
- Drop the parent-commit SHA mislabelled as "HEAD" in the VISION_VALIDATION
status banner and the ALTERED_MINDS runnability note; keep the date + ADR ref.
- Re-attribute the A2-A4 gap claim: cite ADR-014 only for what it states
("the A1 run used dr_grpo"; objective= threading is an open follow-up) and
ADR-013 for its sole-remaining user-gated box — instead of citing ADR-014's
acceptance gate, which does not contain the dataset-construction details.
- Fix a missed stale path examples/qwen3_05b_quickstart -> qwen_05b_quickstart
in VISION_VALIDATION.
- Add the master-branch guard pointer (Fact F) to the README Install block.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README.md +6 -1
- docs/ALTERED_MINDS_TIE_IN.md +10 -7
- docs/OVERVIEW.md +96 -0
- docs/VISION_VALIDATION.md +10 -8
- docs/adrs/README.md +7 -0
|
@@ -28,11 +28,16 @@ pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
|
|
| 28 |
> **Author:** [Codeseys](https://huggingface.co/Codeseys)
|
| 29 |
> **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
|
| 30 |
|
| 31 |
-
This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe.
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Install
|
| 34 |
|
| 35 |
```bash
|
|
|
|
|
|
|
|
|
|
| 36 |
pip install -e .
|
| 37 |
python examples/qwen_05b_quickstart/run.py
|
| 38 |
```
|
|
|
|
| 28 |
> **Author:** [Codeseys](https://huggingface.co/Codeseys)
|
| 29 |
> **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
|
| 30 |
|
| 31 |
+
This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe (this channel is the *framework's own addition* — it is **not** part of Cursor's actual recipe; see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md)).
|
| 32 |
+
|
| 33 |
+
> 🧭 **New here?** Read [`docs/OVERVIEW.md`](docs/OVERVIEW.md) for a 5-minute, honestly-scoped tour: what this is, the three channels with their real provenance, what's proven, and what's still gapped.
|
| 34 |
|
| 35 |
## Install
|
| 36 |
|
| 37 |
```bash
|
| 38 |
+
# If cloning fresh from HF: `git checkout master` first — the Hub `main`
|
| 39 |
+
# branch LAGS `master`, and installing from `main` ImportErrors on
|
| 40 |
+
# make_dr_grpo_config. (See docs/HF_REPO_LAYOUT.md + docs/TROUBLESHOOTING.md.)
|
| 41 |
pip install -e .
|
| 42 |
python examples/qwen_05b_quickstart/run.py
|
| 43 |
```
|
|
@@ -111,16 +111,19 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
|
|
| 111 |
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
|
| 112 |
step — the washout-vs-amplification instrument).
|
| 113 |
|
| 114 |
-
Train for ~500 steps per arm on a single GPU. **Runnability today (
|
| 115 |
-
only **A1 (GRPO-only)** has a real Modal runner (
|
|
|
|
|
|
|
| 116 |
for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
|
| 117 |
is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
|
| 118 |
-
only**
|
| 119 |
-
replay-DPO preference corpus, and an A100 entrypoint that don't exist yet
|
| 120 |
-
|
| 121 |
-
|
| 122 |
ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
|
| 123 |
-
(`examples/altered_minds_channel_ladder/`)
|
|
|
|
| 124 |
|
| 125 |
> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
|
| 126 |
> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
|
|
|
|
| 111 |
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
|
| 112 |
step — the washout-vs-amplification instrument).
|
| 113 |
|
| 114 |
+
Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):**
|
| 115 |
+
only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
|
| 116 |
+
records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
|
| 117 |
+
rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
|
| 118 |
for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
|
| 119 |
is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
|
| 120 |
+
only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
|
| 121 |
+
dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet —
|
| 122 |
+
none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally*
|
| 123 |
+
**user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
|
| 124 |
ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
|
| 125 |
+
(`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
|
| 126 |
+
user-gated real-spend go/no-go.
|
| 127 |
|
| 128 |
> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
|
| 129 |
> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
|
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Overview — Composer 2.5 Replication Framework (5-minute read)
|
| 2 |
+
|
| 3 |
+
*Current through [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) (2026-06). For
|
| 4 |
+
the front-door pitch see [`README.md`](../README.md); for the honest gap list see
|
| 5 |
+
[`BACKLOG.md`](../BACKLOG.md); for the clause-by-clause vision audit see
|
| 6 |
+
[`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md).*
|
| 7 |
+
|
| 8 |
+
## What it is
|
| 9 |
+
|
| 10 |
+
An **open, methodology-first replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5)**
|
| 11 |
+
recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic
|
| 12 |
+
coder — generalized so it runs on **any HuggingFace causal LM with a chat template** (Qwen,
|
| 13 |
+
Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package
|
| 14 |
+
(`pip install -e .` → `composer_replication`) plus a research corpus (ADRs, deep-dives,
|
| 15 |
+
recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of
|
| 16 |
+
scope for v0.
|
| 17 |
+
|
| 18 |
+
This repo is the **methodology repo** ("the paper of the project"). Trained-variant model
|
| 19 |
+
repos and trace datasets are split out per [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
|
| 20 |
+
|
| 21 |
+
## The three channels — with honest provenance
|
| 22 |
+
|
| 23 |
+
The framework composes a single training loss out of three additive channels. **Two replicate
|
| 24 |
+
Cursor's published recipe; the third is the framework's own research addition.** Getting this
|
| 25 |
+
provenance right is the whole point — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
|
| 26 |
+
|
| 27 |
+
| # | Channel | What it is | Provenance |
|
| 28 |
+
|---|---|---|---|
|
| 29 |
+
| **1** | **Base policy optimization** | RL on verifiable rewards (RLVR). Default **Dr.GRPO**, now a **selectable menu** (`make_po_config(objective=…)` over `{grpo, dr_grpo, bnpo, dapo, gspo, cispo}`, per ADR-014). | ✅ **Genuine replication.** Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. |
|
| 30 |
+
| **2** | **SDPO self-distillation** | Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a *self-teacher* → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). | ✅ **Genuine replication.** This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. |
|
| 31 |
+
| **3** | **Trace-replay-DPO** | Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder ([ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)). | ⚠️ **The framework's OWN additive research channel — NOT part of Cursor's recipe.** Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks *on top of* the genuine replication; it does not define it. |
|
| 32 |
+
|
| 33 |
+
> **Read this before citing the framework.** Any statement of the form "Composer does
|
| 34 |
+
> trace-replay-DPO" or "the replication target includes channel 3" is **wrong**. Cursor's
|
| 35 |
+
> recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so.
|
| 36 |
+
|
| 37 |
+
The full loss (verification-harness form) is `total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo`;
|
| 38 |
+
production uses `ComposerReplicationTrainer._compute_loss` (a real `trl.GRPOTrainer` subclass),
|
| 39 |
+
where channel 1 is real GRPO rather than the LM-CE stub. See
|
| 40 |
+
[`docs/USER_GUIDE.md`](USER_GUIDE.md) and [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md).
|
| 41 |
+
|
| 42 |
+
## What's proven
|
| 43 |
+
|
| 44 |
+
- **CPU SDPO-fires.** On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires
|
| 45 |
+
(`sdpo_jsd > 0`) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just
|
| 46 |
+
memorization?" critique is closed (Spike 006-strict).
|
| 47 |
+
- **Real GPU run.** Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss
|
| 48 |
+
0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini).
|
| 49 |
+
- **A1 8B-ladder Modal run.** The GRPO-only arm (A1) of the LMA channel ladder has a real
|
| 50 |
+
Modal runner and has been run with `dr_grpo`.
|
| 51 |
+
- **GSM8K GRPO.** The `examples/gsm8k_grpo*` end-to-end examples exercise the production
|
| 52 |
+
trainer on a real reasoning benchmark.
|
| 53 |
+
- **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0
|
| 54 |
+
errors (Spike 001).
|
| 55 |
+
- **Installable + tested.** `pip install -e .` works; **115 passing tests + 1 skip-marked**
|
| 56 |
+
(canonical count: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).
|
| 57 |
+
|
| 58 |
+
## What's gapped (honest, NOT closed)
|
| 59 |
+
|
| 60 |
+
1. **Docker / TorchForge substrate E2E** is **hardware-blocked** — the test exists and skips
|
| 61 |
+
cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end.
|
| 62 |
+
2. **The full 8B LMA channel ladder (A2–A4) is not yet runnable.** Only **A1 (GRPO-only)**
|
| 63 |
+
has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold +
|
| 64 |
+
plan-builder only — running them on a real 8B checkpoint additionally needs a real
|
| 65 |
+
error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that
|
| 66 |
+
don't exist yet. The real 8B run is *additionally* user-budget-gated.
|
| 67 |
+
3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
|
| 68 |
+
GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
|
| 69 |
+
|
| 70 |
+
See [`BACKLOG.md`](../BACKLOG.md) for the live gap list and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
|
| 71 |
+
for known foot-guns.
|
| 72 |
+
|
| 73 |
+
## Foot-guns worth knowing on day one
|
| 74 |
+
|
| 75 |
+
- **HF `main` lags `master`.** After cloning from the Hub, `git checkout master` (or pin a
|
| 76 |
+
master SHA) before `pip install -e .`, or you `ImportError` on `make_dr_grpo_config`. Same
|
| 77 |
+
for any Modal / HF-Jobs clone-and-install step.
|
| 78 |
+
- **`strip_thinking` × SDPO.** On real agent traces, SDPO requires `strip_thinking=False`:
|
| 79 |
+
~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks.
|
| 80 |
+
- **KL estimator delta.** TRL uses the **k3** estimator; Composer's report describes **k1**.
|
| 81 |
+
This is a documented, intentional delta — the framework does not silently claim k1 parity.
|
| 82 |
+
- **`compose_loss` is the verification harness, not production.** Its channel-1 is an LM-CE
|
| 83 |
+
stub, not real GRPO. Production training is `ComposerReplicationTrainer`.
|
| 84 |
+
|
| 85 |
+
## Where to go next
|
| 86 |
+
|
| 87 |
+
| You want to… | Read |
|
| 88 |
+
|---|---|
|
| 89 |
+
| Pitch / status / roadmap | [`README.md`](../README.md) |
|
| 90 |
+
| Run it end-to-end | [`docs/USER_GUIDE.md`](USER_GUIDE.md) |
|
| 91 |
+
| Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch | [`docs/INTEGRATION_RECIPES.md`](INTEGRATION_RECIPES.md) |
|
| 92 |
+
| Exact kwargs / signatures | [`docs/API_REFERENCE.md`](API_REFERENCE.md) |
|
| 93 |
+
| Why each design decision | [`docs/adrs/README.md`](adrs/README.md) |
|
| 94 |
+
| How Cursor's recipe maps to our components | [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) |
|
| 95 |
+
| Honest gaps / open work | [`BACKLOG.md`](../BACKLOG.md), [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md) |
|
| 96 |
+
| Fix a broken install / run | [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) |
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Vision Validation: Does the Framework Encapsulate the Original Brief?
|
| 2 |
|
| 3 |
-
> **## Status as of 2026-06
|
| 4 |
> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
|
| 5 |
> tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
|
| 6 |
> canonical count), and operational end-to-end examples (`gsm8k_grpo`,
|
|
@@ -15,12 +15,14 @@
|
|
| 15 |
> 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
|
| 16 |
> cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
|
| 17 |
> 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
|
| 18 |
-
> has a real Modal runner
|
| 19 |
-
>
|
| 20 |
-
>
|
| 21 |
-
>
|
| 22 |
-
>
|
| 23 |
-
>
|
|
|
|
|
|
|
| 24 |
|
| 25 |
> **Status:** Self-audit, 2026-05-25 (Wave 6).
|
| 26 |
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
|
|
@@ -274,7 +276,7 @@ In recommended-do-next order:
|
|
| 274 |
1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
|
| 275 |
2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
|
| 276 |
3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
|
| 277 |
-
4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/
|
| 278 |
5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
|
| 279 |
|
| 280 |
Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
|
|
|
|
| 1 |
# Vision Validation: Does the Framework Encapsulate the Original Brief?
|
| 2 |
|
| 3 |
+
> **## Status as of 2026-06 (current through ADR-014)**
|
| 4 |
> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
|
| 5 |
> tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
|
| 6 |
> canonical count), and operational end-to-end examples (`gsm8k_grpo`,
|
|
|
|
| 15 |
> 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
|
| 16 |
> cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
|
| 17 |
> 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
|
| 18 |
+
> has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
|
| 19 |
+
> records that "the A1 run used `dr_grpo`" and that threading the `objective=` menu
|
| 20 |
+
> through the rest of the ladder runners is an open follow-up. **A2 (SDPO) / A3
|
| 21 |
+
> (replay-DPO) / A4 (combined)** are scaffold + plan-builder only; running them on a
|
| 22 |
+
> real 8B checkpoint additionally needs a real error-trace SDPO dataset, a replay-DPO
|
| 23 |
+
> preference corpus, and an A100 entrypoint that don't exist yet. The real 8B run is
|
| 24 |
+
> *additionally* user-budget-gated (the sole remaining acceptance-gate box in
|
| 25 |
+
> [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)).
|
| 26 |
|
| 27 |
> **Status:** Self-audit, 2026-05-25 (Wave 6).
|
| 28 |
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
|
|
|
|
| 276 |
1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
|
| 277 |
2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
|
| 278 |
3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
|
| 279 |
+
4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen_05b_quickstart/` directory + entry-point exposure.
|
| 280 |
5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
|
| 281 |
|
| 282 |
Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
|
|
@@ -15,5 +15,12 @@
|
|
| 15 |
| [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
|
| 16 |
| [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
|
| 17 |
| [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
|
|
|
|
| 18 |
|
| 19 |
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
| [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
|
| 16 |
| [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
|
| 17 |
| [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
|
| 18 |
+
| [ADR-014](ADR-014-policy-optimization-objective-menu.md) | Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) | accepted | 2026-05-30 |
|
| 19 |
|
| 20 |
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|
| 21 |
+
|
| 22 |
+
> **Provenance note (ADR-014).** ADR-014 also records the canonical correction that the
|
| 23 |
+
> framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT
|
| 24 |
+
> part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference
|
| 25 |
+
> pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
|
| 26 |
+
> + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary.
|