HF Discussion thread (draft) — Pre-experimental release

Where to post: the repo's Community / Discussions tab. Title (suggested): Methodology release: Composer 2.5 replication framework + novel trace-replay channel (pre-experimental, looking for feedback) Tag: release, discussion

Hi all 👋

I'm releasing this repo as a pre-experimental methodology paper + integration architecture + working code skeleton for an open replication of Cursor's Composer 2.5. I want to be upfront: there are no training results yet. The full empirical validation is gated on GPU compute commitment for spikes 002–004 (described below). This release is to invite feedback before I burn GPU budget.

What's in the box right now

1. Methodology paper. publications/PAPER_v0.md — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.

2. Composer 2.5 blog audit. docs/COMPOSER_RECIPE_MAPPING.md — every claim tagged [BLOG-VERIFIED] / [INFERRED] / [EXTRAPOLATED]. Major finding: Cursor's "Targeted RL with Textual Feedback" is mathematically the same as published SDPO (Hübotter et al., ICLR 2026; arXiv:2601.20802) and OPSD (Zhao et al., arXiv:2601.18734), with MIT-licensed reference code at siyan-zhao/OPSD. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.

3. Integration architecture doc. docs/INTEGRATION_ARCHITECTURE.md — verified extension points (via DeepWiki audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form total = grpo + α·sdpo + β·trace_replay_dpo and an argument that the three channels don't compete for any shared resource.

4. Empirical economic feasibility result for the novel channel. spikes/001-teacher-replay-cost/verdict.md — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s. Reproducible: set OPENROUTER_API_KEY and run the three scripts.

5. Working code skeleton with 38 passing unit tests. spikes/005-integrated-trainer-skeleton/ — ports of OPSD's generalized_jsd_loss, the teacher-replay DPO-pair extractor, the data collator, both a ComposerReplicationTrainer(GRPOTrainer) for TRL and a @register_adv_est("grpo_composer") stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run decreases loss with α=0.1, β=0.05 — the channels don't fight each other.

$ python3 -m pytest tests/ -v
============================== 38 passed in 3.43s ==============================

What I'm explicitly NOT claiming

Claim	Status	Validates via
Trace-replay distillation improves SWE-bench-lite pass@1 over plain GRPO	Open	Spike 004 (Qwen3-7B A/B, ~$300 GPU + $50 eval, 6 runs × 3 seeds)
Teacher disagreement at the step level carries non-trivial signal on real traces	Open	Spike 003 (≥5 pairs/trace, non-trivial KL distance from random pairs)
TRL+OpenEnv emits clean trace JSONL on a real agentic env	Open	Spike 002a (100 rollouts on Modal A100)
PRIME-RL is the better trace substrate at scale	Open	Spike 002b (head-to-head with 002a)

Translation: I have a framework that compiles, integration that's verified at the unit-test level, and economic feasibility for the novel channel. I do not yet have evidence that the method actually improves training.

What I'm specifically asking for

Critical reads of the integration architecture. If you've worked with TRL's GRPOTrainer._compute_loss, VeRL's @register_adv_est, or OPSD's loss code, I'd love to know if I'm misreading any of the extension surfaces. The DeepWiki audits gave me confidence but they're not the same as someone who's actually shipped on these frameworks calling out a foot-gun.
Adjacent-work pointers. Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
Reward-hacking ideas for the v0.1 environment. Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in PAPER_v0.md §8 but more eyes welcome.
Collaboration interest for spike 002. If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.

Why publish before experiments

Trade-off acknowledged: someone else may run spike 004 first. Two reasons I'm publishing now anyway:

The integration architecture and the OPSD-lift insight are independently useful. Other teams designing similar Composer-replications shouldn't have to rediscover that there's MIT-licensed reference code for the SDPO loss.
Early feedback may catch design errors before I burn GPU budget. The cost of being scooped on the experiment is much smaller than the cost of running a failed experiment because of an integration bug I didn't catch.

The repo

🤗 huggingface.co/Codeseys/composer-replication-framework

License: MIT (methodology + code; upstream papers and code retain their respective licenses).

Looking forward to feedback.

Pre-experimental v0.0 release, 2026-05-25. v0.1 will incorporate spike 002–004 results.