Codeseys Claude Opus 4.8 (1M context) commited on
Commit
20e3bd9
·
1 Parent(s): aae66fa

docs(wave1): correctness pass — channel-3 provenance, gap honesty, dead links

Browse files

Propagate ADR-014 ground truth into the living docs and fix every dead
relative link. Docs-only; no source files touched.

Provenance (Fact A) — stop blurring the additive trace-replay-DPO channel
into Cursor's recipe:
- README v0.1 roadmap cell, HF_REPO_LAYOUT v0/v1 variant rows: reframe
"Composer recipe" = channels 1 (Dr.GRPO) + 2 (SDPO) only; trace-replay
is the framework's own addition (link ADR-014).

Gap honesty (Fact D) — do not imply the full A1-A4 LMA ladder is runnable:
- VISION_VALIDATION status banner + ALTERED_MINDS_TIE_IN Phase-3: only A1
(GRPO-only) has a real Modal runner; A2/A3/A4 are scaffold + plan-builder
only, blocked on dataset construction. Real 8B run is additionally
budget-gated.

PO menu / strip_thinking / main-lag (Facts B, E, F):
- VISION_VALIDATION: base RL objective is a selectable menu (default
Dr.GRPO) per ADR-014, not hardcoded.
- ALTERED_MINDS_TIE_IN: note SDPO requires strip_thinking=False on real
traces (~67% of error-recovery turns are pure thinking).
- USER_GUIDE: add `git checkout master` to the clone+install (HF main lags
master; otherwise ImportError on make_dr_grpo_config).

Freshness + dead links:
- VISION_VALIDATION: stale "210 tests" -> point to the canonical count in
V1_V8_COVERAGE.md (115 + 1 skip-marked); status banner dated to HEAD.
- BACKLOG: fix examples/qwen3_05b_quickstart -> examples/qwen_05b_quickstart.
- INTEGRATION_RECIPES: ADR-007-distillation-losses -> -self-distillation-.
- framework/ and publications/HF_DISCUSSION_POST: 9 root-relative links
that 404 from a subdirectory on HF Hub now use correct ../ prefixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BACKLOG.md CHANGED
@@ -71,8 +71,8 @@ Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datag
71
  **Acceptance**:
72
  1. `pyproject.toml` at repo root, package name `composer_replication`.
73
  2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
74
- 3. `examples/qwen3_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
75
- 4. README quickstart updated to `pip install -e .` + `python examples/qwen3_05b_quickstart/run.py`.
76
  5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
77
 
78
  ### Post-Skeleton Waves (Datagen, Alignment, Quality)
 
71
  **Acceptance**:
72
  1. `pyproject.toml` at repo root, package name `composer_replication`.
73
  2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
74
+ 3. `examples/qwen_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
75
+ 4. README quickstart updated to `pip install -e .` + `python examples/qwen_05b_quickstart/run.py`.
76
  5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
77
 
78
  ### Post-Skeleton Waves (Datagen, Alignment, Quality)
README.md CHANGED
@@ -163,7 +163,7 @@ The novel contribution is channel (3) — no published work systematically repla
163
  | Phase | Timeline | Goal | Trained variant repo | Data repo |
164
  |---|---|---|---|---|
165
  | **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
166
- | **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay) on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
167
  | **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
168
 
169
  Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
 
163
  | Phase | Timeline | Goal | Trained variant repo | Data repo |
164
  |---|---|---|---|---|
165
  | **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
166
+ | **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill, i.e. channels 1–2 = Dr.GRPO + SDPO) **plus** the framework's own additive trace-replay-DPO channel, on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. *(trace-replay-DPO is our addition, not part of Cursor's recipe — see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md).)* | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
167
  | **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
168
 
169
  Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
docs/ALTERED_MINDS_TIE_IN.md CHANGED
@@ -111,11 +111,21 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
111
  rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
112
  step — the washout-vs-amplification instrument).
113
 
114
- Train for ~500 steps per arm on a single GPU (Qwen-0.5B feasibility-test
115
- already confirmed; for Llama-8B, use Modal + the framework's `ServerlessExecutor`
116
- per ADR-005 local 5090 is too small). The real 8B/LMA-checkpoint run remains
117
- **user-gated** (it spends grant budget) ADR-013 ships the capability, proven
118
- CPU-only on a small model (`examples/altered_minds_channel_ladder/`).
 
 
 
 
 
 
 
 
 
 
119
 
120
  ### Phase 4 — re-evaluate
121
 
 
111
  rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
112
  step — the washout-vs-amplification instrument).
113
 
114
+ Train for ~500 steps per arm on a single GPU. **Runnability today (HEAD `aae66fa`):**
115
+ only **A1 (GRPO-only)** has a real Modal runner (Qwen-0.5B feasibility-test confirmed;
116
+ for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 local 5090
117
+ is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
118
+ only**, blocked on dataset construction — they need a real error-trace SDPO dataset, a
119
+ replay-DPO preference corpus, and an A100 entrypoint that don't exist yet (see
120
+ [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) acceptance gate). The real
121
+ 8B/LMA-checkpoint run is *additionally* **user-gated** (it spends grant budget). ADR-013
122
+ ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
123
+ (`examples/altered_minds_channel_ladder/`).
124
+
125
+ > **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
126
+ > agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
127
+ > pure thinking, so stripping them yields empty SDPO masks (the channel silently
128
+ > contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.
129
 
130
  ### Phase 4 — re-evaluate
131
 
docs/HF_REPO_LAYOUT.md CHANGED
@@ -17,7 +17,7 @@ When the v0.0 spike produces a result, the following repos will be created:
17
  | Repo | Type | Created when | Contents |
18
  |---|---|---|---|
19
  | `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
20
- | `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with GRPO + trace-replay-DPO |
21
  | `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
22
 
23
  After v0.1:
@@ -26,7 +26,7 @@ After v0.1:
26
  |---|---|---|
27
  | `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
28
  | `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
29
- | `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-recipe v1 trained variant |
30
 
31
  All trained-variant repos will:
32
  - Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.
 
17
  | Repo | Type | Created when | Contents |
18
  |---|---|---|---|
19
  | `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
20
+ | `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with Dr.GRPO + the framework's own additive trace-replay-DPO research channel (channel 3 is our addition, **not** part of Cursor's Composer recipe — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)) |
21
  | `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
22
 
23
  After v0.1:
 
26
  |---|---|---|
27
  | `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
28
  | `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
29
+ | `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-channel (Dr.GRPO + SDPO) v1 trained variant, combined with the framework's additive trace-replay-DPO research channel (trace-replay is our addition, not Composer's recipe) |
30
 
31
  All trained-variant repos will:
32
  - Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.
docs/INTEGRATION_RECIPES.md CHANGED
@@ -978,7 +978,7 @@ adapter boundary, not because the loss math is wrong.
978
  - ADRs:
979
  [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
980
  [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
981
- [`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)
982
 
983
  ---
984
 
 
978
  - ADRs:
979
  [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
980
  [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
981
+ [`docs/adrs/ADR-007-self-distillation-losses.md`](adrs/ADR-007-self-distillation-losses.md)
982
 
983
  ---
984
 
docs/USER_GUIDE.md CHANGED
@@ -59,9 +59,17 @@ Always start with the core install:
59
  ```bash
60
  git clone https://huggingface.co/Codeseys/composer-replication-framework
61
  cd composer-replication-framework
 
62
  pip install -e .
63
  ```
64
 
 
 
 
 
 
 
 
65
  That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
66
  verification harness on CPU (sections 3, 5, 6).
67
 
 
59
  ```bash
60
  git clone https://huggingface.co/Codeseys/composer-replication-framework
61
  cd composer-replication-framework
62
+ git checkout master # HF 'main' LAGS 'master'; without this you ImportError on make_dr_grpo_config
63
  pip install -e .
64
  ```
65
 
66
+ > **Branch foot-gun.** The HF Hub `main` branch lags `master`. Always
67
+ > `git checkout master` (or pin a known-good master SHA) before `pip install -e .`
68
+ > — otherwise newer symbols such as `make_dr_grpo_config` / `make_po_config`
69
+ > are missing and you hit an `ImportError`. The same applies to any Modal /
70
+ > HF-Jobs clone-and-install step. See [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
71
+ > and [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
72
+
73
  That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
74
  verification harness on CPU (sections 3, 5, 6).
75
 
docs/VISION_VALIDATION.md CHANGED
@@ -1,11 +1,26 @@
1
  # Vision Validation: Does the Framework Encapsulate the Original Brief?
2
 
3
- > **## Status as of 2026-05-29**
4
- > The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 210 passing tests, and operational end-to-end examples (`gsm8k_grpo`, `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation, trace-ingestion, and DiLoCo have all shipped and been cross-family reviewed.
 
 
 
 
 
 
 
 
5
  >
6
- > **Two remaining honest gaps:**
7
- > 1. Docker/TorchForge substrate E2E is hardware-blocked (lacking local multi-GPU rig for the orchestrator layer).
8
- > 2. Real LMA full-scale run (8B model, 10k SWE-bench traces) is user-budget-gated.
 
 
 
 
 
 
 
9
 
10
  > **Status:** Self-audit, 2026-05-25 (Wave 6).
11
  > **Question:** Does what we've built reflect what was originally asked for, or did we drift?
 
1
  # Vision Validation: Does the Framework Encapsulate the Original Brief?
2
 
3
+ > **## Status as of 2026-06-08 (HEAD `aae66fa`, ADR-014)**
4
+ > The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
5
+ > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
6
+ > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
7
+ > `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation,
8
+ > trace-ingestion, and DiLoCo have all shipped and been cross-family reviewed. The base
9
+ > RL objective is now a **selectable menu** (default Dr.GRPO; ADR-014) rather than
10
+ > hardcoded. **Channel 3 (trace-replay-DPO) is the framework's own additive research
11
+ > channel — not part of Cursor's Composer recipe** (Composer = channels 1 Dr.GRPO + 2
12
+ > SDPO only; ADR-014).
13
  >
14
+ > **Two remaining honest gaps (NOT closed):**
15
+ > 1. Docker/TorchForge substrate E2E is hardware-blocked the test exists and skips
16
+ > cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
17
+ > 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
18
+ > has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold
19
+ > + plan-builder only, blocked on dataset construction (a real error-trace SDPO dataset
20
+ > + a replay-DPO preference corpus + an A100 entrypoint that don't exist yet). The real
21
+ > 8B run is *additionally* user-budget-gated. See
22
+ > [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md) +
23
+ > [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
24
 
25
  > **Status:** Self-audit, 2026-05-25 (Wave 6).
26
  > **Question:** Does what we've built reflect what was originally asked for, or did we drift?
framework/composer-replication-framework.md CHANGED
@@ -41,7 +41,7 @@ From `01-composer-2.5.md`:
41
 
42
  ## How the 5 component pieces fit together
43
 
44
- For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).
45
 
46
  The high-level topology:
47
 
@@ -128,7 +128,7 @@ From `05-trace-replay-distillation.md`:
128
  - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
129
  - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
130
 
131
- These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
132
 
133
  **Cost mitigation** (the report does this analysis well):
134
  - VOI gating (only query teachers when student entropy is high) → 60-80% savings
 
41
 
42
  ## How the 5 component pieces fit together
43
 
44
+ For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/).
45
 
46
  The high-level topology:
47
 
 
128
  - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
129
  - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
130
 
131
+ These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
132
 
133
  **Cost mitigation** (the report does this analysis well):
134
  - VOI gating (only query teachers when student entropy is high) → 60-80% savings
publications/HF_DISCUSSION_POST.md CHANGED
@@ -12,15 +12,15 @@ I'm releasing this repo as a **pre-experimental methodology paper + integration
12
 
13
  ## What's in the box right now
14
 
15
- **1. Methodology paper.** [`publications/PAPER_v0.md`](publications/PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
16
 
17
- **2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
18
 
19
- **3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
20
 
21
- **4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
22
 
23
- **5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
24
 
25
  ```
26
  $ python3 -m pytest tests/ -v
@@ -44,7 +44,7 @@ Translation: *I have a framework that compiles, integration that's verified at t
44
 
45
  2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
46
 
47
- 3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](publications/PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
48
 
49
  4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
50
 
 
12
 
13
  ## What's in the box right now
14
 
15
+ **1. Methodology paper.** [`publications/PAPER_v0.md`](PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
16
 
17
+ **2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
18
 
19
+ **3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
20
 
21
+ **4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
22
 
23
+ **5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
24
 
25
  ```
26
  $ python3 -m pytest tests/ -v
 
44
 
45
  2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
46
 
47
+ 3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
48
 
49
  4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
50