Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs(wave1): correctness pass — channel-3 provenance, gap honesty, dead links
Browse filesPropagate ADR-014 ground truth into the living docs and fix every dead
relative link. Docs-only; no source files touched.
Provenance (Fact A) — stop blurring the additive trace-replay-DPO channel
into Cursor's recipe:
- README v0.1 roadmap cell, HF_REPO_LAYOUT v0/v1 variant rows: reframe
"Composer recipe" = channels 1 (Dr.GRPO) + 2 (SDPO) only; trace-replay
is the framework's own addition (link ADR-014).
Gap honesty (Fact D) — do not imply the full A1-A4 LMA ladder is runnable:
- VISION_VALIDATION status banner + ALTERED_MINDS_TIE_IN Phase-3: only A1
(GRPO-only) has a real Modal runner; A2/A3/A4 are scaffold + plan-builder
only, blocked on dataset construction. Real 8B run is additionally
budget-gated.
PO menu / strip_thinking / main-lag (Facts B, E, F):
- VISION_VALIDATION: base RL objective is a selectable menu (default
Dr.GRPO) per ADR-014, not hardcoded.
- ALTERED_MINDS_TIE_IN: note SDPO requires strip_thinking=False on real
traces (~67% of error-recovery turns are pure thinking).
- USER_GUIDE: add `git checkout master` to the clone+install (HF main lags
master; otherwise ImportError on make_dr_grpo_config).
Freshness + dead links:
- VISION_VALIDATION: stale "210 tests" -> point to the canonical count in
V1_V8_COVERAGE.md (115 + 1 skip-marked); status banner dated to HEAD.
- BACKLOG: fix examples/qwen3_05b_quickstart -> examples/qwen_05b_quickstart.
- INTEGRATION_RECIPES: ADR-007-distillation-losses -> -self-distillation-.
- framework/ and publications/HF_DISCUSSION_POST: 9 root-relative links
that 404 from a subdirectory on HF Hub now use correct ../ prefixes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- BACKLOG.md +2 -2
- README.md +1 -1
- docs/ALTERED_MINDS_TIE_IN.md +15 -5
- docs/HF_REPO_LAYOUT.md +2 -2
- docs/INTEGRATION_RECIPES.md +1 -1
- docs/USER_GUIDE.md +8 -0
- docs/VISION_VALIDATION.md +20 -5
- framework/composer-replication-framework.md +2 -2
- publications/HF_DISCUSSION_POST.md +6 -6
|
@@ -71,8 +71,8 @@ Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datag
|
|
| 71 |
**Acceptance**:
|
| 72 |
1. `pyproject.toml` at repo root, package name `composer_replication`.
|
| 73 |
2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
|
| 74 |
-
3. `examples/
|
| 75 |
-
4. README quickstart updated to `pip install -e .` + `python examples/
|
| 76 |
5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
|
| 77 |
|
| 78 |
### Post-Skeleton Waves (Datagen, Alignment, Quality)
|
|
|
|
| 71 |
**Acceptance**:
|
| 72 |
1. `pyproject.toml` at repo root, package name `composer_replication`.
|
| 73 |
2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
|
| 74 |
+
3. `examples/qwen_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
|
| 75 |
+
4. README quickstart updated to `pip install -e .` + `python examples/qwen_05b_quickstart/run.py`.
|
| 76 |
5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
|
| 77 |
|
| 78 |
### Post-Skeleton Waves (Datagen, Alignment, Quality)
|
|
@@ -163,7 +163,7 @@ The novel contribution is channel (3) — no published work systematically repla
|
|
| 163 |
| Phase | Timeline | Goal | Trained variant repo | Data repo |
|
| 164 |
|---|---|---|---|---|
|
| 165 |
| **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
|
| 166 |
-
| **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay
|
| 167 |
| **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
|
| 168 |
|
| 169 |
Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
|
|
|
|
| 163 |
| Phase | Timeline | Goal | Trained variant repo | Data repo |
|
| 164 |
|---|---|---|---|---|
|
| 165 |
| **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
|
| 166 |
+
| **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill, i.e. channels 1–2 = Dr.GRPO + SDPO) **plus** the framework's own additive trace-replay-DPO channel, on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. *(trace-replay-DPO is our addition, not part of Cursor's recipe — see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md).)* | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
|
| 167 |
| **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
|
| 168 |
|
| 169 |
Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
|
|
@@ -111,11 +111,21 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
|
|
| 111 |
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
|
| 112 |
step — the washout-vs-amplification instrument).
|
| 113 |
|
| 114 |
-
Train for ~500 steps per arm on a single GPU (
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
**
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
### Phase 4 — re-evaluate
|
| 121 |
|
|
|
|
| 111 |
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
|
| 112 |
step — the washout-vs-amplification instrument).
|
| 113 |
|
| 114 |
+
Train for ~500 steps per arm on a single GPU. **Runnability today (HEAD `aae66fa`):**
|
| 115 |
+
only **A1 (GRPO-only)** has a real Modal runner (Qwen-0.5B feasibility-test confirmed;
|
| 116 |
+
for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
|
| 117 |
+
is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
|
| 118 |
+
only**, blocked on dataset construction — they need a real error-trace SDPO dataset, a
|
| 119 |
+
replay-DPO preference corpus, and an A100 entrypoint that don't exist yet (see
|
| 120 |
+
[ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) acceptance gate). The real
|
| 121 |
+
8B/LMA-checkpoint run is *additionally* **user-gated** (it spends grant budget). ADR-013
|
| 122 |
+
ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
|
| 123 |
+
(`examples/altered_minds_channel_ladder/`).
|
| 124 |
+
|
| 125 |
+
> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
|
| 126 |
+
> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
|
| 127 |
+
> pure thinking, so stripping them yields empty SDPO masks (the channel silently
|
| 128 |
+
> contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.
|
| 129 |
|
| 130 |
### Phase 4 — re-evaluate
|
| 131 |
|
|
@@ -17,7 +17,7 @@ When the v0.0 spike produces a result, the following repos will be created:
|
|
| 17 |
| Repo | Type | Created when | Contents |
|
| 18 |
|---|---|---|---|
|
| 19 |
| `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
|
| 20 |
-
| `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with GRPO + trace-replay-DPO |
|
| 21 |
| `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
|
| 22 |
|
| 23 |
After v0.1:
|
|
@@ -26,7 +26,7 @@ After v0.1:
|
|
| 26 |
|---|---|---|
|
| 27 |
| `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
|
| 28 |
| `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
|
| 29 |
-
| `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-
|
| 30 |
|
| 31 |
All trained-variant repos will:
|
| 32 |
- Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.
|
|
|
|
| 17 |
| Repo | Type | Created when | Contents |
|
| 18 |
|---|---|---|---|
|
| 19 |
| `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
|
| 20 |
+
| `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with Dr.GRPO + the framework's own additive trace-replay-DPO research channel (channel 3 is our addition, **not** part of Cursor's Composer recipe — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)) |
|
| 21 |
| `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
|
| 22 |
|
| 23 |
After v0.1:
|
|
|
|
| 26 |
|---|---|---|
|
| 27 |
| `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
|
| 28 |
| `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
|
| 29 |
+
| `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-channel (Dr.GRPO + SDPO) v1 trained variant, combined with the framework's additive trace-replay-DPO research channel (trace-replay is our addition, not Composer's recipe) |
|
| 30 |
|
| 31 |
All trained-variant repos will:
|
| 32 |
- Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.
|
|
@@ -978,7 +978,7 @@ adapter boundary, not because the loss math is wrong.
|
|
| 978 |
- ADRs:
|
| 979 |
[`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
|
| 980 |
[`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
|
| 981 |
-
[`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)
|
| 982 |
|
| 983 |
---
|
| 984 |
|
|
|
|
| 978 |
- ADRs:
|
| 979 |
[`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
|
| 980 |
[`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
|
| 981 |
+
[`docs/adrs/ADR-007-self-distillation-losses.md`](adrs/ADR-007-self-distillation-losses.md)
|
| 982 |
|
| 983 |
---
|
| 984 |
|
|
@@ -59,9 +59,17 @@ Always start with the core install:
|
|
| 59 |
```bash
|
| 60 |
git clone https://huggingface.co/Codeseys/composer-replication-framework
|
| 61 |
cd composer-replication-framework
|
|
|
|
| 62 |
pip install -e .
|
| 63 |
```
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
|
| 66 |
verification harness on CPU (sections 3, 5, 6).
|
| 67 |
|
|
|
|
| 59 |
```bash
|
| 60 |
git clone https://huggingface.co/Codeseys/composer-replication-framework
|
| 61 |
cd composer-replication-framework
|
| 62 |
+
git checkout master # HF 'main' LAGS 'master'; without this you ImportError on make_dr_grpo_config
|
| 63 |
pip install -e .
|
| 64 |
```
|
| 65 |
|
| 66 |
+
> **Branch foot-gun.** The HF Hub `main` branch lags `master`. Always
|
| 67 |
+
> `git checkout master` (or pin a known-good master SHA) before `pip install -e .`
|
| 68 |
+
> — otherwise newer symbols such as `make_dr_grpo_config` / `make_po_config`
|
| 69 |
+
> are missing and you hit an `ImportError`. The same applies to any Modal /
|
| 70 |
+
> HF-Jobs clone-and-install step. See [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
|
| 71 |
+
> and [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
|
| 72 |
+
|
| 73 |
That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
|
| 74 |
verification harness on CPU (sections 3, 5, 6).
|
| 75 |
|
|
@@ -1,11 +1,26 @@
|
|
| 1 |
# Vision Validation: Does the Framework Encapsulate the Original Brief?
|
| 2 |
|
| 3 |
-
> **## Status as of 2026-
|
| 4 |
-
> The framework is past-skeleton: 8 subpackages (`composer_replication/*`),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
>
|
| 6 |
-
> **Two remaining honest gaps:**
|
| 7 |
-
> 1. Docker/TorchForge substrate E2E is hardware-blocked
|
| 8 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
> **Status:** Self-audit, 2026-05-25 (Wave 6).
|
| 11 |
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
|
|
|
|
| 1 |
# Vision Validation: Does the Framework Encapsulate the Original Brief?
|
| 2 |
|
| 3 |
+
> **## Status as of 2026-06-08 (HEAD `aae66fa`, ADR-014)**
|
| 4 |
+
> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
|
| 5 |
+
> tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
|
| 6 |
+
> canonical count), and operational end-to-end examples (`gsm8k_grpo`,
|
| 7 |
+
> `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation,
|
| 8 |
+
> trace-ingestion, and DiLoCo have all shipped and been cross-family reviewed. The base
|
| 9 |
+
> RL objective is now a **selectable menu** (default Dr.GRPO; ADR-014) rather than
|
| 10 |
+
> hardcoded. **Channel 3 (trace-replay-DPO) is the framework's own additive research
|
| 11 |
+
> channel — not part of Cursor's Composer recipe** (Composer = channels 1 Dr.GRPO + 2
|
| 12 |
+
> SDPO only; ADR-014).
|
| 13 |
>
|
| 14 |
+
> **Two remaining honest gaps (NOT closed):**
|
| 15 |
+
> 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
|
| 16 |
+
> cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
|
| 17 |
+
> 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
|
| 18 |
+
> has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold
|
| 19 |
+
> + plan-builder only, blocked on dataset construction (a real error-trace SDPO dataset
|
| 20 |
+
> + a replay-DPO preference corpus + an A100 entrypoint that don't exist yet). The real
|
| 21 |
+
> 8B run is *additionally* user-budget-gated. See
|
| 22 |
+
> [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md) +
|
| 23 |
+
> [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
|
| 24 |
|
| 25 |
> **Status:** Self-audit, 2026-05-25 (Wave 6).
|
| 26 |
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
|
|
@@ -41,7 +41,7 @@ From `01-composer-2.5.md`:
|
|
| 41 |
|
| 42 |
## How the 5 component pieces fit together
|
| 43 |
|
| 44 |
-
For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).
|
| 45 |
|
| 46 |
The high-level topology:
|
| 47 |
|
|
@@ -128,7 +128,7 @@ From `05-trace-replay-distillation.md`:
|
|
| 128 |
- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
|
| 129 |
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
|
| 130 |
|
| 131 |
-
These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
|
| 132 |
|
| 133 |
**Cost mitigation** (the report does this analysis well):
|
| 134 |
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
|
|
|
|
| 41 |
|
| 42 |
## How the 5 component pieces fit together
|
| 43 |
|
| 44 |
+
For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/).
|
| 45 |
|
| 46 |
The high-level topology:
|
| 47 |
|
|
|
|
| 128 |
- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
|
| 129 |
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
|
| 130 |
|
| 131 |
+
These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
|
| 132 |
|
| 133 |
**Cost mitigation** (the report does this analysis well):
|
| 134 |
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
|
|
@@ -12,15 +12,15 @@ I'm releasing this repo as a **pre-experimental methodology paper + integration
|
|
| 12 |
|
| 13 |
## What's in the box right now
|
| 14 |
|
| 15 |
-
**1. Methodology paper.** [`publications/PAPER_v0.md`](
|
| 16 |
|
| 17 |
-
**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
|
| 18 |
|
| 19 |
-
**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
|
| 20 |
|
| 21 |
-
**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
|
| 22 |
|
| 23 |
-
**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
|
| 24 |
|
| 25 |
```
|
| 26 |
$ python3 -m pytest tests/ -v
|
|
@@ -44,7 +44,7 @@ Translation: *I have a framework that compiles, integration that's verified at t
|
|
| 44 |
|
| 45 |
2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
|
| 46 |
|
| 47 |
-
3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](
|
| 48 |
|
| 49 |
4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
|
| 50 |
|
|
|
|
| 12 |
|
| 13 |
## What's in the box right now
|
| 14 |
|
| 15 |
+
**1. Methodology paper.** [`publications/PAPER_v0.md`](PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
|
| 16 |
|
| 17 |
+
**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
|
| 18 |
|
| 19 |
+
**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
|
| 20 |
|
| 21 |
+
**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
|
| 22 |
|
| 23 |
+
**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
|
| 24 |
|
| 25 |
```
|
| 26 |
$ python3 -m pytest tests/ -v
|
|
|
|
| 44 |
|
| 45 |
2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
|
| 46 |
|
| 47 |
+
3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
|
| 48 |
|
| 49 |
4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
|
| 50 |
|