docs(wave1): correctness pass — channel-3 provenance, gap honesty, dead links

Propagate ADR-014 ground truth into the living docs and fix every dead
relative link. Docs-only; no source files touched.

Provenance (Fact A) — stop blurring the additive trace-replay-DPO channel
into Cursor's recipe:
- README v0.1 roadmap cell, HF_REPO_LAYOUT v0/v1 variant rows: reframe
"Composer recipe" = channels 1 (Dr.GRPO) + 2 (SDPO) only; trace-replay
is the framework's own addition (link ADR-014).

Gap honesty (Fact D) — do not imply the full A1-A4 LMA ladder is runnable:
- VISION_VALIDATION status banner + ALTERED_MINDS_TIE_IN Phase-3: only A1
(GRPO-only) has a real Modal runner; A2/A3/A4 are scaffold + plan-builder
only, blocked on dataset construction. Real 8B run is additionally
budget-gated.

PO menu / strip_thinking / main-lag (Facts B, E, F):
- VISION_VALIDATION: base RL objective is a selectable menu (default
Dr.GRPO) per ADR-014, not hardcoded.
- ALTERED_MINDS_TIE_IN: note SDPO requires strip_thinking=False on real
traces (~67% of error-recovery turns are pure thinking).
- USER_GUIDE: add `git checkout master` to the clone+install (HF main lags
master; otherwise ImportError on make_dr_grpo_config).

Freshness + dead links:
- VISION_VALIDATION: stale "210 tests" -> point to the canonical count in
V1_V8_COVERAGE.md (115 + 1 skip-marked); status banner dated to HEAD.
- BACKLOG: fix examples/qwen3_05b_quickstart -> examples/qwen_05b_quickstart.
- INTEGRATION_RECIPES: ADR-007-distillation-losses -> -self-distillation-.
- framework/ and publications/HF_DISCUSSION_POST: 9 root-relative links
that 404 from a subdirectory on HF Hub now use correct ../ prefixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (9) hide show

BACKLOG.md +2 -2
README.md +1 -1
docs/ALTERED_MINDS_TIE_IN.md +15 -5
docs/HF_REPO_LAYOUT.md +2 -2
docs/INTEGRATION_RECIPES.md +1 -1
docs/USER_GUIDE.md +8 -0
docs/VISION_VALIDATION.md +20 -5
framework/composer-replication-framework.md +2 -2
publications/HF_DISCUSSION_POST.md +6 -6

BACKLOG.md CHANGED Viewed

@@ -71,8 +71,8 @@ Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datag
 **Acceptance**:
 1. `pyproject.toml` at repo root, package name `composer_replication`.
 2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
-3. `examples/qwen3_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
-4. README quickstart updated to `pip install -e .` + `python examples/qwen3_05b_quickstart/run.py`.
 5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
 ### Post-Skeleton Waves (Datagen, Alignment, Quality)

 **Acceptance**:
 1. `pyproject.toml` at repo root, package name `composer_replication`.
 2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
+3. `examples/qwen_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
+4. README quickstart updated to `pip install -e .` + `python examples/qwen_05b_quickstart/run.py`.
 5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
 ### Post-Skeleton Waves (Datagen, Alignment, Quality)

README.md CHANGED Viewed

@@ -163,7 +163,7 @@ The novel contribution is channel (3) — no published work systematically repla
 | Phase | Timeline | Goal | Trained variant repo | Data repo |
 |---|---|---|---|---|
 | **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
-| **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay) on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
 | **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
 Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.

 | Phase | Timeline | Goal | Trained variant repo | Data repo |
 |---|---|---|---|---|
 | **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
+| **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill, i.e. channels 1–2 = Dr.GRPO + SDPO) **plus** the framework's own additive trace-replay-DPO channel, on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. *(trace-replay-DPO is our addition, not part of Cursor's recipe — see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md).)* | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
 | **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
 Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.

docs/ALTERED_MINDS_TIE_IN.md CHANGED Viewed

@@ -111,11 +111,21 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
 rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
 step — the washout-vs-amplification instrument).
-Train for ~500 steps per arm on a single GPU (Qwen-0.5B feasibility-test
-already confirmed; for Llama-8B, use Modal + the framework's `ServerlessExecutor`
-per ADR-005 — local 5090 is too small). The real 8B/LMA-checkpoint run remains
-**user-gated** (it spends grant budget) — ADR-013 ships the capability, proven
-CPU-only on a small model (`examples/altered_minds_channel_ladder/`).
 ### Phase 4 — re-evaluate

 rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
 step — the washout-vs-amplification instrument).
+Train for ~500 steps per arm on a single GPU. **Runnability today (HEAD `aae66fa`):**
+only **A1 (GRPO-only)** has a real Modal runner (Qwen-0.5B feasibility-test confirmed;
+for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
+is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
+only**, blocked on dataset construction — they need a real error-trace SDPO dataset, a
+replay-DPO preference corpus, and an A100 entrypoint that don't exist yet (see
+[ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) acceptance gate). The real
+8B/LMA-checkpoint run is *additionally* **user-gated** (it spends grant budget). ADR-013
+ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
+(`examples/altered_minds_channel_ladder/`).
+> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
+> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
+> pure thinking, so stripping them yields empty SDPO masks (the channel silently
+> contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.
 ### Phase 4 — re-evaluate

docs/HF_REPO_LAYOUT.md CHANGED Viewed

@@ -17,7 +17,7 @@ When the v0.0 spike produces a result, the following repos will be created:
 | Repo | Type | Created when | Contents |
 |---|---|---|---|
 | `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
-| `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with GRPO + trace-replay-DPO |
 | `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
 After v0.1:
@@ -26,7 +26,7 @@ After v0.1:
 |---|---|---|
 | `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
 | `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
-| `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-recipe v1 trained variant |
 All trained-variant repos will:
 - Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.

 | Repo | Type | Created when | Contents |
 |---|---|---|---|
 | `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
+| `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with Dr.GRPO + the framework's own additive trace-replay-DPO research channel (channel 3 is our addition, **not** part of Cursor's Composer recipe — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)) |
 | `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
 After v0.1:
 |---|---|---|
 | `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
 | `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
+| `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-channel (Dr.GRPO + SDPO) v1 trained variant, combined with the framework's additive trace-replay-DPO research channel (trace-replay is our addition, not Composer's recipe) |
 All trained-variant repos will:
 - Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.

docs/INTEGRATION_RECIPES.md CHANGED Viewed

@@ -978,7 +978,7 @@ adapter boundary, not because the loss math is wrong.
 - ADRs:
   [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
   [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
-  [`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)
 ---

 - ADRs:
   [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
   [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
+  [`docs/adrs/ADR-007-self-distillation-losses.md`](adrs/ADR-007-self-distillation-losses.md)
 ---

docs/USER_GUIDE.md CHANGED Viewed

@@ -59,9 +59,17 @@ Always start with the core install:
 ```bash
 git clone https://huggingface.co/Codeseys/composer-replication-framework
 cd composer-replication-framework
 pip install -e .
 ```
 That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
 verification harness on CPU (sections 3, 5, 6).

 ```bash
 git clone https://huggingface.co/Codeseys/composer-replication-framework
 cd composer-replication-framework
+git checkout master   # HF 'main' LAGS 'master'; without this you ImportError on make_dr_grpo_config
 pip install -e .
 ```
+> **Branch foot-gun.** The HF Hub `main` branch lags `master`. Always
+> `git checkout master` (or pin a known-good master SHA) before `pip install -e .`
+> — otherwise newer symbols such as `make_dr_grpo_config` / `make_po_config`
+> are missing and you hit an `ImportError`. The same applies to any Modal /
+> HF-Jobs clone-and-install step. See [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
+> and [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
 That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
 verification harness on CPU (sections 3, 5, 6).

docs/VISION_VALIDATION.md CHANGED Viewed

@@ -1,11 +1,26 @@
 # Vision Validation: Does the Framework Encapsulate the Original Brief?
-> **## Status as of 2026-05-29**
-> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 210 passing tests, and operational end-to-end examples (`gsm8k_grpo`, `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation, trace-ingestion, and DiLoCo have all shipped and been cross-family reviewed.
 >
-> **Two remaining honest gaps:**
-> 1. Docker/TorchForge substrate E2E is hardware-blocked (lacking local multi-GPU rig for the orchestrator layer).
-> 2. Real LMA full-scale run (8B model, 10k SWE-bench traces) is user-budget-gated.
 > **Status:** Self-audit, 2026-05-25 (Wave 6).
 > **Question:** Does what we've built reflect what was originally asked for, or did we drift?

 # Vision Validation: Does the Framework Encapsulate the Original Brief?
+> **## Status as of 2026-06-08 (HEAD `aae66fa`, ADR-014)**
+> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
+> tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
+> canonical count), and operational end-to-end examples (`gsm8k_grpo`,
+> `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation,
+> trace-ingestion, and DiLoCo have all shipped and been cross-family reviewed. The base
+> RL objective is now a **selectable menu** (default Dr.GRPO; ADR-014) rather than
+> hardcoded. **Channel 3 (trace-replay-DPO) is the framework's own additive research
+> channel — not part of Cursor's Composer recipe** (Composer = channels 1 Dr.GRPO + 2
+> SDPO only; ADR-014).
 >
+> **Two remaining honest gaps (NOT closed):**
+> 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
+>    cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
+> 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
+>    has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold
+>    + plan-builder only, blocked on dataset construction (a real error-trace SDPO dataset
+>    + a replay-DPO preference corpus + an A100 entrypoint that don't exist yet). The real
+>    8B run is *additionally* user-budget-gated. See
+>    [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md) +
+>    [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
 > **Status:** Self-audit, 2026-05-25 (Wave 6).
 > **Question:** Does what we've built reflect what was originally asked for, or did we drift?

framework/composer-replication-framework.md CHANGED Viewed

@@ -41,7 +41,7 @@ From `01-composer-2.5.md`:
 ## How the 5 component pieces fit together
-For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).
 The high-level topology:
@@ -128,7 +128,7 @@ From `05-trace-replay-distillation.md`:
 - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
 - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
-These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
 **Cost mitigation** (the report does this analysis well):
 - VOI gating (only query teachers when student entropy is high) → 60-80% savings

 ## How the 5 component pieces fit together
+For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/).
 The high-level topology:
 - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
 - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
+These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
 **Cost mitigation** (the report does this analysis well):
 - VOI gating (only query teachers when student entropy is high) → 60-80% savings

publications/HF_DISCUSSION_POST.md CHANGED Viewed

@@ -12,15 +12,15 @@ I'm releasing this repo as a **pre-experimental methodology paper + integration
 ## What's in the box right now
-**1. Methodology paper.** [`publications/PAPER_v0.md`](publications/PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
-**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
-**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
-**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
-**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
 ```
 $ python3 -m pytest tests/ -v
@@ -44,7 +44,7 @@ Translation: *I have a framework that compiles, integration that's verified at t
 2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
-3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](publications/PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
 4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.

 ## What's in the box right now
+**1. Methodology paper.** [`publications/PAPER_v0.md`](PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
+**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
+**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
+**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
+**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
 ```
 $ python3 -m pytest tests/ -v
 2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
+3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
 4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.