Codeseys Claude Opus 4.8 (1M context) commited on
Commit
e130879
·
1 Parent(s): f00833d

docs(wave3): add OVERVIEW.md, index ADR-014, fold in adversarial-review fixes

Browse files

New-docs + freshness wave, plus corrections from an adversarial review pass
on the wave1 commit. Docs-only; no source files touched.

New:
- docs/OVERVIEW.md — a 5-minute, honestly-scoped newcomer tour: what the
framework is, the three channels WITH provenance (1 Dr.GRPO + 2 SDPO =
genuine Composer replication; 3 trace-replay-DPO = the framework's own
additive channel, NOT Cursor's recipe), what's proven (CPU SDPO-fires, the
A1 8B Modal run, GSM8K GRPO, $0.98/trace), and what's gapped (Docker e2e,
A2-A4 ladder). Linked from README + both _archive READMEs.

ADR index:
- Add the missing ADR-014 row + a provenance note recording the channel-3
correction it carries.

Adversarial-review corrections (to the wave1 edits):
- Drop the parent-commit SHA mislabelled as "HEAD" in the VISION_VALIDATION
status banner and the ALTERED_MINDS runnability note; keep the date + ADR ref.
- Re-attribute the A2-A4 gap claim: cite ADR-014 only for what it states
("the A1 run used dr_grpo"; objective= threading is an open follow-up) and
ADR-013 for its sole-remaining user-gated box — instead of citing ADR-014's
acceptance gate, which does not contain the dataset-construction details.
- Fix a missed stale path examples/qwen3_05b_quickstart -> qwen_05b_quickstart
in VISION_VALIDATION.
- Add the master-branch guard pointer (Fact F) to the README Install block.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

README.md CHANGED
@@ -28,11 +28,16 @@ pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
28
  > **Author:** [Codeseys](https://huggingface.co/Codeseys)
29
  > **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
30
 
31
- This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe.
 
 
32
 
33
  ## Install
34
 
35
  ```bash
 
 
 
36
  pip install -e .
37
  python examples/qwen_05b_quickstart/run.py
38
  ```
 
28
  > **Author:** [Codeseys](https://huggingface.co/Codeseys)
29
  > **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
30
 
31
+ This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe (this channel is the *framework's own addition* — it is **not** part of Cursor's actual recipe; see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md)).
32
+
33
+ > 🧭 **New here?** Read [`docs/OVERVIEW.md`](docs/OVERVIEW.md) for a 5-minute, honestly-scoped tour: what this is, the three channels with their real provenance, what's proven, and what's still gapped.
34
 
35
  ## Install
36
 
37
  ```bash
38
+ # If cloning fresh from HF: `git checkout master` first — the Hub `main`
39
+ # branch LAGS `master`, and installing from `main` ImportErrors on
40
+ # make_dr_grpo_config. (See docs/HF_REPO_LAYOUT.md + docs/TROUBLESHOOTING.md.)
41
  pip install -e .
42
  python examples/qwen_05b_quickstart/run.py
43
  ```
docs/ALTERED_MINDS_TIE_IN.md CHANGED
@@ -111,16 +111,19 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
111
  rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
112
  step — the washout-vs-amplification instrument).
113
 
114
- Train for ~500 steps per arm on a single GPU. **Runnability today (HEAD `aae66fa`):**
115
- only **A1 (GRPO-only)** has a real Modal runner (Qwen-0.5B feasibility-test confirmed;
 
 
116
  for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
117
  is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
118
- only**, blocked on dataset construction they need a real error-trace SDPO dataset, a
119
- replay-DPO preference corpus, and an A100 entrypoint that don't exist yet (see
120
- [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) acceptance gate). The real
121
- 8B/LMA-checkpoint run is *additionally* **user-gated** (it spends grant budget). ADR-013
122
  ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
123
- (`examples/altered_minds_channel_ladder/`).
 
124
 
125
  > **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
126
  > agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
 
111
  rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
112
  step — the washout-vs-amplification instrument).
113
 
114
+ Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):**
115
+ only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
116
+ records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
117
+ rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
118
  for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
119
  is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
120
+ only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
121
+ dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet
122
+ none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally*
123
+ **user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
124
  ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
125
+ (`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
126
+ user-gated real-spend go/no-go.
127
 
128
  > **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
129
  > agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
docs/OVERVIEW.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Overview — Composer 2.5 Replication Framework (5-minute read)
2
+
3
+ *Current through [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) (2026-06). For
4
+ the front-door pitch see [`README.md`](../README.md); for the honest gap list see
5
+ [`BACKLOG.md`](../BACKLOG.md); for the clause-by-clause vision audit see
6
+ [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md).*
7
+
8
+ ## What it is
9
+
10
+ An **open, methodology-first replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5)**
11
+ recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic
12
+ coder — generalized so it runs on **any HuggingFace causal LM with a chat template** (Qwen,
13
+ Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package
14
+ (`pip install -e .` → `composer_replication`) plus a research corpus (ADRs, deep-dives,
15
+ recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of
16
+ scope for v0.
17
+
18
+ This repo is the **methodology repo** ("the paper of the project"). Trained-variant model
19
+ repos and trace datasets are split out per [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
20
+
21
+ ## The three channels — with honest provenance
22
+
23
+ The framework composes a single training loss out of three additive channels. **Two replicate
24
+ Cursor's published recipe; the third is the framework's own research addition.** Getting this
25
+ provenance right is the whole point — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
26
+
27
+ | # | Channel | What it is | Provenance |
28
+ |---|---|---|---|
29
+ | **1** | **Base policy optimization** | RL on verifiable rewards (RLVR). Default **Dr.GRPO**, now a **selectable menu** (`make_po_config(objective=…)` over `{grpo, dr_grpo, bnpo, dapo, gspo, cispo}`, per ADR-014). | ✅ **Genuine replication.** Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. |
30
+ | **2** | **SDPO self-distillation** | Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a *self-teacher* → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). | ✅ **Genuine replication.** This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. |
31
+ | **3** | **Trace-replay-DPO** | Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder ([ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)). | ⚠️ **The framework's OWN additive research channel — NOT part of Cursor's recipe.** Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks *on top of* the genuine replication; it does not define it. |
32
+
33
+ > **Read this before citing the framework.** Any statement of the form "Composer does
34
+ > trace-replay-DPO" or "the replication target includes channel 3" is **wrong**. Cursor's
35
+ > recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so.
36
+
37
+ The full loss (verification-harness form) is `total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo`;
38
+ production uses `ComposerReplicationTrainer._compute_loss` (a real `trl.GRPOTrainer` subclass),
39
+ where channel 1 is real GRPO rather than the LM-CE stub. See
40
+ [`docs/USER_GUIDE.md`](USER_GUIDE.md) and [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md).
41
+
42
+ ## What's proven
43
+
44
+ - **CPU SDPO-fires.** On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires
45
+ (`sdpo_jsd > 0`) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just
46
+ memorization?" critique is closed (Spike 006-strict).
47
+ - **Real GPU run.** Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss
48
+ 0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini).
49
+ - **A1 8B-ladder Modal run.** The GRPO-only arm (A1) of the LMA channel ladder has a real
50
+ Modal runner and has been run with `dr_grpo`.
51
+ - **GSM8K GRPO.** The `examples/gsm8k_grpo*` end-to-end examples exercise the production
52
+ trainer on a real reasoning benchmark.
53
+ - **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0
54
+ errors (Spike 001).
55
+ - **Installable + tested.** `pip install -e .` works; **115 passing tests + 1 skip-marked**
56
+ (canonical count: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).
57
+
58
+ ## What's gapped (honest, NOT closed)
59
+
60
+ 1. **Docker / TorchForge substrate E2E** is **hardware-blocked** — the test exists and skips
61
+ cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end.
62
+ 2. **The full 8B LMA channel ladder (A2–A4) is not yet runnable.** Only **A1 (GRPO-only)**
63
+ has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold +
64
+ plan-builder only — running them on a real 8B checkpoint additionally needs a real
65
+ error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that
66
+ don't exist yet. The real 8B run is *additionally* user-budget-gated.
67
+ 3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
68
+ GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
69
+
70
+ See [`BACKLOG.md`](../BACKLOG.md) for the live gap list and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
71
+ for known foot-guns.
72
+
73
+ ## Foot-guns worth knowing on day one
74
+
75
+ - **HF `main` lags `master`.** After cloning from the Hub, `git checkout master` (or pin a
76
+ master SHA) before `pip install -e .`, or you `ImportError` on `make_dr_grpo_config`. Same
77
+ for any Modal / HF-Jobs clone-and-install step.
78
+ - **`strip_thinking` × SDPO.** On real agent traces, SDPO requires `strip_thinking=False`:
79
+ ~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks.
80
+ - **KL estimator delta.** TRL uses the **k3** estimator; Composer's report describes **k1**.
81
+ This is a documented, intentional delta — the framework does not silently claim k1 parity.
82
+ - **`compose_loss` is the verification harness, not production.** Its channel-1 is an LM-CE
83
+ stub, not real GRPO. Production training is `ComposerReplicationTrainer`.
84
+
85
+ ## Where to go next
86
+
87
+ | You want to… | Read |
88
+ |---|---|
89
+ | Pitch / status / roadmap | [`README.md`](../README.md) |
90
+ | Run it end-to-end | [`docs/USER_GUIDE.md`](USER_GUIDE.md) |
91
+ | Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch | [`docs/INTEGRATION_RECIPES.md`](INTEGRATION_RECIPES.md) |
92
+ | Exact kwargs / signatures | [`docs/API_REFERENCE.md`](API_REFERENCE.md) |
93
+ | Why each design decision | [`docs/adrs/README.md`](adrs/README.md) |
94
+ | How Cursor's recipe maps to our components | [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) |
95
+ | Honest gaps / open work | [`BACKLOG.md`](../BACKLOG.md), [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md) |
96
+ | Fix a broken install / run | [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) |
docs/VISION_VALIDATION.md CHANGED
@@ -1,6 +1,6 @@
1
  # Vision Validation: Does the Framework Encapsulate the Original Brief?
2
 
3
- > **## Status as of 2026-06-08 (HEAD `aae66fa`, ADR-014)**
4
  > The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
5
  > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
6
  > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
@@ -15,12 +15,14 @@
15
  > 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
16
  > cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
17
  > 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
18
- > has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold
19
- > + plan-builder only, blocked on dataset construction (a real error-trace SDPO dataset
20
- > + a replay-DPO preference corpus + an A100 entrypoint that don't exist yet). The real
21
- > 8B run is *additionally* user-budget-gated. See
22
- > [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md) +
23
- > [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
 
 
24
 
25
  > **Status:** Self-audit, 2026-05-25 (Wave 6).
26
  > **Question:** Does what we've built reflect what was originally asked for, or did we drift?
@@ -274,7 +276,7 @@ In recommended-do-next order:
274
  1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
275
  2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
276
  3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
277
- 4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen3_05b_quickstart/` directory + entry-point exposure.
278
  5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
279
 
280
  Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
 
1
  # Vision Validation: Does the Framework Encapsulate the Original Brief?
2
 
3
+ > **## Status as of 2026-06 (current through ADR-014)**
4
  > The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
5
  > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
6
  > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
 
15
  > 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
16
  > cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
17
  > 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
18
+ > has a real Modal runner [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
19
+ > records that "the A1 run used `dr_grpo`" and that threading the `objective=` menu
20
+ > through the rest of the ladder runners is an open follow-up. **A2 (SDPO) / A3
21
+ > (replay-DPO) / A4 (combined)** are scaffold + plan-builder only; running them on a
22
+ > real 8B checkpoint additionally needs a real error-trace SDPO dataset, a replay-DPO
23
+ > preference corpus, and an A100 entrypoint that don't exist yet. The real 8B run is
24
+ > *additionally* user-budget-gated (the sole remaining acceptance-gate box in
25
+ > [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)).
26
 
27
  > **Status:** Self-audit, 2026-05-25 (Wave 6).
28
  > **Question:** Does what we've built reflect what was originally asked for, or did we drift?
 
276
  1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
277
  2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
278
  3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
279
+ 4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen_05b_quickstart/` directory + entry-point exposure.
280
  5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
281
 
282
  Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
docs/adrs/README.md CHANGED
@@ -15,5 +15,12 @@
15
  | [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
16
  | [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
17
  | [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
 
18
 
19
  Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
 
 
 
 
 
 
 
15
  | [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
16
  | [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
17
  | [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
18
+ | [ADR-014](ADR-014-policy-optimization-objective-menu.md) | Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) | accepted | 2026-05-30 |
19
 
20
  Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
21
+
22
+ > **Provenance note (ADR-014).** ADR-014 also records the canonical correction that the
23
+ > framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT
24
+ > part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference
25
+ > pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
26
+ > + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary.