baladithyab
Wave 5: full publication-materials drafts (pre-experimental release set)
639a760
# HF Discussion thread (draft) — Pre-experimental release
> **Where to post:** the repo's [Community / Discussions tab](https://huggingface.co/Codeseys/composer-replication-framework/discussions).
> **Title (suggested):** `Methodology release: Composer 2.5 replication framework + novel trace-replay channel (pre-experimental, looking for feedback)`
> **Tag:** `release`, `discussion`
---
Hi all 👋
I'm releasing this repo as a **pre-experimental methodology paper + integration architecture + working code skeleton** for an open replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5). I want to be upfront: there are no training results yet. The full empirical validation is gated on GPU compute commitment for spikes 002–004 (described below). This release is to invite feedback before I burn GPU budget.
## What's in the box right now
**1. Methodology paper.** [`publications/PAPER_v0.md`](publications/PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
```
$ python3 -m pytest tests/ -v
============================== 38 passed in 3.43s ==============================
```
## What I'm explicitly NOT claiming
| Claim | Status | Validates via |
|---|---|---|
| Trace-replay distillation improves SWE-bench-lite pass@1 over plain GRPO | **Open** | Spike 004 (Qwen3-7B A/B, ~$300 GPU + $50 eval, 6 runs × 3 seeds) |
| Teacher disagreement at the step level carries non-trivial signal on real traces | **Open** | Spike 003 (≥5 pairs/trace, non-trivial KL distance from random pairs) |
| TRL+OpenEnv emits clean trace JSONL on a real agentic env | **Open** | Spike 002a (100 rollouts on Modal A100) |
| PRIME-RL is the better trace substrate at scale | **Open** | Spike 002b (head-to-head with 002a) |
Translation: *I have a framework that compiles, integration that's verified at the unit-test level, and economic feasibility for the novel channel. I do not yet have evidence that the method actually improves training.*
## What I'm specifically asking for
1. **Critical reads of the integration architecture.** If you've worked with TRL's `GRPOTrainer._compute_loss`, VeRL's `@register_adv_est`, or OPSD's loss code, I'd love to know if I'm misreading any of the extension surfaces. The [DeepWiki audits](https://deepwiki.com/) gave me confidence but they're not the same as someone who's actually shipped on these frameworks calling out a foot-gun.
2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](publications/PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
## Why publish before experiments
Trade-off acknowledged: someone else may run spike 004 first. Two reasons I'm publishing now anyway:
- The integration architecture and the OPSD-lift insight are independently useful. Other teams designing similar Composer-replications shouldn't have to rediscover that there's MIT-licensed reference code for the SDPO loss.
- Early feedback may catch design errors before I burn GPU budget. The cost of being scooped on the experiment is much smaller than the cost of running a failed experiment because of an integration bug I didn't catch.
## The repo
🤗 [huggingface.co/Codeseys/composer-replication-framework](https://huggingface.co/Codeseys/composer-replication-framework)
License: MIT (methodology + code; upstream papers and code retain their respective licenses).
Looking forward to feedback.
---
*Pre-experimental v0.0 release, 2026-05-25. v0.1 will incorporate spike 002–004 results.*