baladithyab

Wave 5: full publication-materials drafts (pre-experimental release set)

639a760 12 days ago

6.6 kB

	# HF Discussion thread (draft) — Pre-experimental release

	> Where to post: the repo's [Community / Discussions tab](https://huggingface.co/Codeseys/composer-replication-framework/discussions).
	> Title (suggested): `Methodology release: Composer 2.5 replication framework + novel trace-replay channel (pre-experimental, looking for feedback)`
	> Tag: `release`, `discussion`

	---

	Hi all 👋

	I'm releasing this repo as a pre-experimental methodology paper + integration architecture + working code skeleton for an open replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5). I want to be upfront: there are no training results yet. The full empirical validation is gated on GPU compute commitment for spikes 002–004 (described below). This release is to invite feedback before I burn GPU budget.

	## What's in the box right now

	1. Methodology paper. [`publications/PAPER_v0.md`](publications/PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.

	2. Composer 2.5 blog audit. [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is mathematically the same as published SDPO (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD). Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.

	3. Integration architecture doc. [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.

	4. Empirical economic feasibility result for the novel channel. [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.

	5. Working code skeleton with 38 passing unit tests. [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run decreases loss with α=0.1, β=0.05 — the channels don't fight each other.

	```
	$ python3 -m pytest tests/ -v
	============================== 38 passed in 3.43s ==============================
	```

	## What I'm explicitly NOT claiming

	\| Claim \| Status \| Validates via \|
	\|---\|---\|---\|
	\| Trace-replay distillation improves SWE-bench-lite pass@1 over plain GRPO \| Open \| Spike 004 (Qwen3-7B A/B, ~$300 GPU + $50 eval, 6 runs × 3 seeds) \|
	\| Teacher disagreement at the step level carries non-trivial signal on real traces \| Open \| Spike 003 (≥5 pairs/trace, non-trivial KL distance from random pairs) \|
	\| TRL+OpenEnv emits clean trace JSONL on a real agentic env \| Open \| Spike 002a (100 rollouts on Modal A100) \|
	\| PRIME-RL is the better trace substrate at scale \| Open \| Spike 002b (head-to-head with 002a) \|

	Translation: I have a framework that compiles, integration that's verified at the unit-test level, and economic feasibility for the novel channel. I do not yet have evidence that the method actually improves training.

	## What I'm specifically asking for

	1. Critical reads of the integration architecture. If you've worked with TRL's `GRPOTrainer._compute_loss`, VeRL's `@register_adv_est`, or OPSD's loss code, I'd love to know if I'm misreading any of the extension surfaces. The [DeepWiki audits](https://deepwiki.com/) gave me confidence but they're not the same as someone who's actually shipped on these frameworks calling out a foot-gun.

	2. Adjacent-work pointers. Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.

	3. Reward-hacking ideas for the v0.1 environment. Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](publications/PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.

	4. Collaboration interest for spike 002. If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.

	## Why publish before experiments

	Trade-off acknowledged: someone else may run spike 004 first. Two reasons I'm publishing now anyway:

	- The integration architecture and the OPSD-lift insight are independently useful. Other teams designing similar Composer-replications shouldn't have to rediscover that there's MIT-licensed reference code for the SDPO loss.
	- Early feedback may catch design errors before I burn GPU budget. The cost of being scooped on the experiment is much smaller than the cost of running a failed experiment because of an integration bug I didn't catch.

	## The repo

	🤗 [huggingface.co/Codeseys/composer-replication-framework](https://huggingface.co/Codeseys/composer-replication-framework)

	License: MIT (methodology + code; upstream papers and code retain their respective licenses).

	Looking forward to feedback.

	---

	Pre-experimental v0.0 release, 2026-05-25. v0.1 will incorporate spike 002–004 results.

	# HF Discussion thread (draft) — Pre-experimental release

	> Where to post: the repo's [Community / Discussions tab](https://huggingface.co/Codeseys/composer-replication-framework/discussions).
	> Title (suggested): `Methodology release: Composer 2.5 replication framework + novel trace-replay channel (pre-experimental, looking for feedback)`
	> Tag: `release`, `discussion`

	---

	Hi all 👋

	I'm releasing this repo as a pre-experimental methodology paper + integration architecture + working code skeleton for an open replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5). I want to be upfront: there are no training results yet. The full empirical validation is gated on GPU compute commitment for spikes 002–004 (described below). This release is to invite feedback before I burn GPU budget.

	## What's in the box right now

	1. Methodology paper. [`publications/PAPER_v0.md`](publications/PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.

	2. Composer 2.5 blog audit. [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is mathematically the same as published SDPO (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD). Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.

	3. Integration architecture doc. [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.

	4. Empirical economic feasibility result for the novel channel. [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.

	5. Working code skeleton with 38 passing unit tests. [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run decreases loss with α=0.1, β=0.05 — the channels don't fight each other.

	```
	$ python3 -m pytest tests/ -v
	============================== 38 passed in 3.43s ==============================
	```

	## What I'm explicitly NOT claiming

	\| Claim \| Status \| Validates via \|
	\|---\|---\|---\|
	\| Trace-replay distillation improves SWE-bench-lite pass@1 over plain GRPO \| Open \| Spike 004 (Qwen3-7B A/B, ~$300 GPU + $50 eval, 6 runs × 3 seeds) \|
	\| Teacher disagreement at the step level carries non-trivial signal on real traces \| Open \| Spike 003 (≥5 pairs/trace, non-trivial KL distance from random pairs) \|
	\| TRL+OpenEnv emits clean trace JSONL on a real agentic env \| Open \| Spike 002a (100 rollouts on Modal A100) \|
	\| PRIME-RL is the better trace substrate at scale \| Open \| Spike 002b (head-to-head with 002a) \|

	Translation: I have a framework that compiles, integration that's verified at the unit-test level, and economic feasibility for the novel channel. I do not yet have evidence that the method actually improves training.

	## What I'm specifically asking for

	1. Critical reads of the integration architecture. If you've worked with TRL's `GRPOTrainer._compute_loss`, VeRL's `@register_adv_est`, or OPSD's loss code, I'd love to know if I'm misreading any of the extension surfaces. The [DeepWiki audits](https://deepwiki.com/) gave me confidence but they're not the same as someone who's actually shipped on these frameworks calling out a foot-gun.

	2. Adjacent-work pointers. Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.

	3. Reward-hacking ideas for the v0.1 environment. Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](publications/PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.

	4. Collaboration interest for spike 002. If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.

	## Why publish before experiments

	Trade-off acknowledged: someone else may run spike 004 first. Two reasons I'm publishing now anyway:

	- The integration architecture and the OPSD-lift insight are independently useful. Other teams designing similar Composer-replications shouldn't have to rediscover that there's MIT-licensed reference code for the SDPO loss.
	- Early feedback may catch design errors before I burn GPU budget. The cost of being scooped on the experiment is much smaller than the cost of running a failed experiment because of an integration bug I didn't catch.

	## The repo

	🤗 [huggingface.co/Codeseys/composer-replication-framework](https://huggingface.co/Codeseys/composer-replication-framework)

	License: MIT (methodology + code; upstream papers and code retain their respective licenses).

	Looking forward to feedback.

	---

	Pre-experimental v0.0 release, 2026-05-25. v0.1 will incorporate spike 002–004 results.