docs(wave3): add OVERVIEW.md, index ADR-014, fold in adversarial-review fixes

New-docs + freshness wave, plus corrections from an adversarial review pass
on the wave1 commit. Docs-only; no source files touched.

New:
- docs/OVERVIEW.md — a 5-minute, honestly-scoped newcomer tour: what the
framework is, the three channels WITH provenance (1 Dr.GRPO + 2 SDPO =
genuine Composer replication; 3 trace-replay-DPO = the framework's own
additive channel, NOT Cursor's recipe), what's proven (CPU SDPO-fires, the
A1 8B Modal run, GSM8K GRPO, $0.98/trace), and what's gapped (Docker e2e,
A2-A4 ladder). Linked from README + both _archive READMEs.

ADR index:
- Add the missing ADR-014 row + a provenance note recording the channel-3
correction it carries.

Adversarial-review corrections (to the wave1 edits):
- Drop the parent-commit SHA mislabelled as "HEAD" in the VISION_VALIDATION
status banner and the ALTERED_MINDS runnability note; keep the date + ADR ref.
- Re-attribute the A2-A4 gap claim: cite ADR-014 only for what it states
("the A1 run used dr_grpo"; objective= threading is an open follow-up) and
ADR-013 for its sole-remaining user-gated box — instead of citing ADR-014's
acceptance gate, which does not contain the dataset-construction details.
- Fix a missed stale path examples/qwen3_05b_quickstart -> qwen_05b_quickstart
in VISION_VALIDATION.
- Add the master-branch guard pointer (Fact F) to the README Install block.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (5) hide show

README.md +6 -1
docs/ALTERED_MINDS_TIE_IN.md +10 -7
docs/OVERVIEW.md +96 -0
docs/VISION_VALIDATION.md +10 -8
docs/adrs/README.md +7 -0

README.md CHANGED Viewed

@@ -28,11 +28,16 @@ pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
 > **Author:** [Codeseys](https://huggingface.co/Codeseys)
 > **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
-This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe.
 ## Install
 ```bash
 pip install -e .
 python examples/qwen_05b_quickstart/run.py
 ```

 > **Author:** [Codeseys](https://huggingface.co/Codeseys)
 > **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
+This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe (this channel is the *framework's own addition* — it is **not** part of Cursor's actual recipe; see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md)).
+> 🧭 **New here?** Read [`docs/OVERVIEW.md`](docs/OVERVIEW.md) for a 5-minute, honestly-scoped tour: what this is, the three channels with their real provenance, what's proven, and what's still gapped.
 ## Install
 ```bash
+# If cloning fresh from HF: `git checkout master` first — the Hub `main`
+# branch LAGS `master`, and installing from `main` ImportErrors on
+# make_dr_grpo_config. (See docs/HF_REPO_LAYOUT.md + docs/TROUBLESHOOTING.md.)
 pip install -e .
 python examples/qwen_05b_quickstart/run.py
 ```

docs/ALTERED_MINDS_TIE_IN.md CHANGED Viewed

@@ -111,16 +111,19 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
 rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
 step — the washout-vs-amplification instrument).
-Train for ~500 steps per arm on a single GPU. **Runnability today (HEAD `aae66fa`):**
-only **A1 (GRPO-only)** has a real Modal runner (Qwen-0.5B feasibility-test confirmed;
 for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
 is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
-only**, blocked on dataset construction — they need a real error-trace SDPO dataset, a
-replay-DPO preference corpus, and an A100 entrypoint that don't exist yet (see
-[ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) acceptance gate). The real
-8B/LMA-checkpoint run is *additionally* **user-gated** (it spends grant budget). ADR-013
 ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
-(`examples/altered_minds_channel_ladder/`).
 > **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
 > agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are

 rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
 step — the washout-vs-amplification instrument).
+Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):**
+only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
+records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
+rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
 for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
 is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
+only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
+dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet —
+none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally*
+**user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
 ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
+(`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
+user-gated real-spend go/no-go.
 > **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
 > agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are

docs/OVERVIEW.md ADDED Viewed

	@@ -0,0 +1,96 @@

+# Overview — Composer 2.5 Replication Framework (5-minute read)
+*Current through [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) (2026-06). For
+the front-door pitch see [`README.md`](../README.md); for the honest gap list see
+[`BACKLOG.md`](../BACKLOG.md); for the clause-by-clause vision audit see
+[`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md).*
+## What it is
+An **open, methodology-first replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5)**
+recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic
+coder — generalized so it runs on **any HuggingFace causal LM with a chat template** (Qwen,
+Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package
+(`pip install -e .` → `composer_replication`) plus a research corpus (ADRs, deep-dives,
+recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of
+scope for v0.
+This repo is the **methodology repo** ("the paper of the project"). Trained-variant model
+repos and trace datasets are split out per [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
+## The three channels — with honest provenance
+The framework composes a single training loss out of three additive channels. **Two replicate
+Cursor's published recipe; the third is the framework's own research addition.** Getting this
+provenance right is the whole point — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
+| # | Channel | What it is | Provenance |
+|---|---|---|---|
+| **1** | **Base policy optimization** | RL on verifiable rewards (RLVR). Default **Dr.GRPO**, now a **selectable menu** (`make_po_config(objective=…)` over `{grpo, dr_grpo, bnpo, dapo, gspo, cispo}`, per ADR-014). | ✅ **Genuine replication.** Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. |
+| **2** | **SDPO self-distillation** | Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a *self-teacher* → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). | ✅ **Genuine replication.** This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. |
+| **3** | **Trace-replay-DPO** | Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder ([ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)). | ⚠️ **The framework's OWN additive research channel — NOT part of Cursor's recipe.** Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks *on top of* the genuine replication; it does not define it. |
+> **Read this before citing the framework.** Any statement of the form "Composer does
+> trace-replay-DPO" or "the replication target includes channel 3" is **wrong**. Cursor's
+> recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so.
+The full loss (verification-harness form) is `total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo`;
+production uses `ComposerReplicationTrainer._compute_loss` (a real `trl.GRPOTrainer` subclass),
+where channel 1 is real GRPO rather than the LM-CE stub. See
+[`docs/USER_GUIDE.md`](USER_GUIDE.md) and [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md).
+## What's proven
+- **CPU SDPO-fires.** On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires
+  (`sdpo_jsd > 0`) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just
+  memorization?" critique is closed (Spike 006-strict).
+- **Real GPU run.** Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss
+  0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini).
+- **A1 8B-ladder Modal run.** The GRPO-only arm (A1) of the LMA channel ladder has a real
+  Modal runner and has been run with `dr_grpo`.
+- **GSM8K GRPO.** The `examples/gsm8k_grpo*` end-to-end examples exercise the production
+  trainer on a real reasoning benchmark.
+- **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0
+  errors (Spike 001).
+- **Installable + tested.** `pip install -e .` works; **115 passing tests + 1 skip-marked**
+  (canonical count: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).
+## What's gapped (honest, NOT closed)
+1. **Docker / TorchForge substrate E2E** is **hardware-blocked** — the test exists and skips
+   cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end.
+2. **The full 8B LMA channel ladder (A2–A4) is not yet runnable.** Only **A1 (GRPO-only)**
+   has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold +
+   plan-builder only — running them on a real 8B checkpoint additionally needs a real
+   error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that
+   don't exist yet. The real 8B run is *additionally* user-budget-gated.
+3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
+   GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
+See [`BACKLOG.md`](../BACKLOG.md) for the live gap list and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
+for known foot-guns.
+## Foot-guns worth knowing on day one
+- **HF `main` lags `master`.** After cloning from the Hub, `git checkout master` (or pin a
+  master SHA) before `pip install -e .`, or you `ImportError` on `make_dr_grpo_config`. Same
+  for any Modal / HF-Jobs clone-and-install step.
+- **`strip_thinking` × SDPO.** On real agent traces, SDPO requires `strip_thinking=False`:
+  ~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks.
+- **KL estimator delta.** TRL uses the **k3** estimator; Composer's report describes **k1**.
+  This is a documented, intentional delta — the framework does not silently claim k1 parity.
+- **`compose_loss` is the verification harness, not production.** Its channel-1 is an LM-CE
+  stub, not real GRPO. Production training is `ComposerReplicationTrainer`.
+## Where to go next
+| You want to… | Read |
+|---|---|
+| Pitch / status / roadmap | [`README.md`](../README.md) |
+| Run it end-to-end | [`docs/USER_GUIDE.md`](USER_GUIDE.md) |
+| Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch | [`docs/INTEGRATION_RECIPES.md`](INTEGRATION_RECIPES.md) |
+| Exact kwargs / signatures | [`docs/API_REFERENCE.md`](API_REFERENCE.md) |
+| Why each design decision | [`docs/adrs/README.md`](adrs/README.md) |
+| How Cursor's recipe maps to our components | [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) |
+| Honest gaps / open work | [`BACKLOG.md`](../BACKLOG.md), [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md) |
+| Fix a broken install / run | [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) |

docs/VISION_VALIDATION.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Vision Validation: Does the Framework Encapsulate the Original Brief?
-> **## Status as of 2026-06-08 (HEAD `aae66fa`, ADR-014)**
 > The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
 > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
 > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
@@ -15,12 +15,14 @@
 > 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
 >    cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
 > 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
->    has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold
->    + plan-builder only, blocked on dataset construction (a real error-trace SDPO dataset
->    + a replay-DPO preference corpus + an A100 entrypoint that don't exist yet). The real
->    8B run is *additionally* user-budget-gated. See
->    [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md) +
->    [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
 > **Status:** Self-audit, 2026-05-25 (Wave 6).
 > **Question:** Does what we've built reflect what was originally asked for, or did we drift?
@@ -274,7 +276,7 @@ In recommended-do-next order:
 1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
 2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
 3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
-4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen3_05b_quickstart/` directory + entry-point exposure.
 5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
 Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.

 # Vision Validation: Does the Framework Encapsulate the Original Brief?
+> **## Status as of 2026-06 (current through ADR-014)**
 > The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
 > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
 > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
 > 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
 >    cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
 > 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
+>    has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
+>    records that "the A1 run used `dr_grpo`" and that threading the `objective=` menu
+>    through the rest of the ladder runners is an open follow-up. **A2 (SDPO) / A3
+>    (replay-DPO) / A4 (combined)** are scaffold + plan-builder only; running them on a
+>    real 8B checkpoint additionally needs a real error-trace SDPO dataset, a replay-DPO
+>    preference corpus, and an A100 entrypoint that don't exist yet. The real 8B run is
+>    *additionally* user-budget-gated (the sole remaining acceptance-gate box in
+>    [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)).
 > **Status:** Self-audit, 2026-05-25 (Wave 6).
 > **Question:** Does what we've built reflect what was originally asked for, or did we drift?
 1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
 2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
 3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
+4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen_05b_quickstart/` directory + entry-point exposure.
 5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
 Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.

docs/adrs/README.md CHANGED Viewed

@@ -15,5 +15,12 @@
 | [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
 | [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
 | [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
 Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.

 | [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
 | [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
 | [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
+| [ADR-014](ADR-014-policy-optimization-objective-menu.md) | Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) | accepted | 2026-05-30 |
 Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
+> **Provenance note (ADR-014).** ADR-014 also records the canonical correction that the
+> framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT
+> part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference
+> pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
+> + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary.