Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 5: full publication-materials drafts (pre-experimental release set)
Browse filesUser asked: draft all publication materials until we can do proper
experimentation to back up our ideas.
Drafted a complete pre-experimental publication wave: methodology paper,
blog post, repo Discussion announcement, X/LinkedIn threads, CITATION.cff
+ .bib at repo root, plus a RELEASE_CHECKLIST coordinating sequencing,
embargo, and risk register. None of this is posted publicly yet — that's
a separate decision when ready to ship.
Honest framing reused throughout: this is methodology + economic-feasibility
+ integration architecture, not training results. Spike 002-004 produce
the empirical training validation in a v0.1 follow-up.
Added:
publications/PAPER_v0.md (33KB, ~6500 words)
Longform methodology paper: 10 sections covering Composer 2.5 audit,
integration architecture, novel TR-DPO channel, spike 001 + 005 results,
honest "what's NOT proven" section, reward-hacking proposals, full
citation list with primary sources.
publications/BLOG_POST.md (16KB, ~2400 words)
HuggingFace Blog markdown format with YAML frontmatter. Anchor narrative:
Cursor's secret sauce = published SDPO/OPSD with MIT code; novel TR-DPO
channel; 38/38 tests pass; what we're NOT claiming yet.
publications/HF_DISCUSSION_POST.md (~700 words)
Repo Community-tab announcement. Specifically asks for: critical reads
of integration architecture, adjacent-work pointers, reward-hacking
ideas, collaboration interest for spike 002.
publications/TWITTER_THREAD.md (~1200 words)
Three variants: 13-tweet long-form thread, 5-tweet short variant,
LinkedIn longer-form post. All anchored on the HF repo URL.
publications/RELEASE_CHECKLIST.md
Coordinates the publication wave: pre-flight checklist, sequencing
recommendation (HF Discussion → Blog → X → arXiv), distribution
amplification ideas, embargo / coordination notes, risk register,
post-publication tracking protocol.
publications/README.md
Index for the publications directory.
CITATION.cff (root)
HuggingFace/GitHub Citation File Format. Renders as a "Cite this
repository" UI on the repo page. Includes references to Cursor blog,
OPSD, and SDPO papers as primary upstream sources.
CITATION.bib (root)
BibTeX equivalent for academic citation.
README.md (root)
Updated to surface the publications/ section with one paragraph linking
the release checklist.
Tests: 38/38 still pass (defensive re-run to confirm publication wave
didn't touch code paths).
Total publication wave: 1,203 lines of markdown across 8 files. All MIT
licensed. Ready to ship when you decide to publish.
- CITATION.bib +48 -0
- CITATION.cff +106 -0
- README.md +2 -0
- publications/BLOG_POST.md +187 -0
- publications/HF_DISCUSSION_POST.md +68 -0
- publications/PAPER_v0.md +452 -0
- publications/README.md +35 -0
- publications/RELEASE_CHECKLIST.md +118 -0
- publications/TWITTER_THREAD.md +189 -0
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
% Composer 2.5 Replication Framework — BibTeX citation file
|
| 2 |
+
% https://huggingface.co/Codeseys/composer-replication-framework
|
| 3 |
+
%
|
| 4 |
+
% Citation order: this work first, then the upstream sources you'd typically
|
| 5 |
+
% cite alongside it (Cursor blog, OPSD, SDPO).
|
| 6 |
+
|
| 7 |
+
@misc{composer-replication-framework-2026,
|
| 8 |
+
author = {Codeseys},
|
| 9 |
+
title = {Composer 2.5 Replication Framework: Methodology and Integration Architecture for Open Replication of Cursor's Agentic Coding Recipe},
|
| 10 |
+
year = {2026},
|
| 11 |
+
publisher = {HuggingFace},
|
| 12 |
+
howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}},
|
| 13 |
+
note = {Pre-experimental v0.0 release. Methodology, integration architecture across TRL/VeRL/OpenEnv, and economic-feasibility result for novel multi-teacher trace-replay channel. Empirical training validation in follow-up paper.}
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
@article{cursor2026composer25,
|
| 17 |
+
title = {Introducing {C}omposer 2.5},
|
| 18 |
+
author = {{Cursor Team}},
|
| 19 |
+
year = {2026},
|
| 20 |
+
url = {https://cursor.com/blog/composer-2-5},
|
| 21 |
+
note = {Cursor blog. Cited in Section 2 of the framework's methodology paper.}
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
@article{zhao2026opsd,
|
| 25 |
+
title = {Self-{D}istilled {R}easoner: {O}n-{P}olicy {S}elf-{D}istillation for {L}arge {L}anguage {M}odels},
|
| 26 |
+
author = {Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya},
|
| 27 |
+
year = {2026},
|
| 28 |
+
journal = {arXiv preprint arXiv:2601.18734},
|
| 29 |
+
url = {https://arxiv.org/abs/2601.18734},
|
| 30 |
+
note = {OPSD. MIT-licensed reference implementation at \url{https://github.com/siyan-zhao/OPSD}; the framework lifts \texttt{generalized\_jsd\_loss} from this codebase.}
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
@article{hubotter2026sdpo,
|
| 34 |
+
title = {Reinforcement {L}earning via {S}elf-{D}istillation},
|
| 35 |
+
author = {H{\"u}botter, Jonas and L{\"u}beck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Buening, Thomas Kleine and Guestrin, Carlos and Krause, Andreas},
|
| 36 |
+
year = {2026},
|
| 37 |
+
journal = {arXiv preprint arXiv:2601.20802},
|
| 38 |
+
url = {https://arxiv.org/abs/2601.20802},
|
| 39 |
+
note = {SDPO. ICLR 2026 Scaling Post-training Workshop. Mathematically equivalent to Cursor's ``Targeted RL with Textual Feedback.''}
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
@article{moonshot2026kimi-k25,
|
| 43 |
+
title = {{K}imi {K}2.5},
|
| 44 |
+
author = {{Moonshot AI}},
|
| 45 |
+
year = {2026},
|
| 46 |
+
url = {https://huggingface.co/moonshotai/Kimi-K2-Thinking},
|
| 47 |
+
note = {Open-source 1T-total / 32B-active MoE base model used by Cursor for Composer 2 / 2.5.}
|
| 48 |
+
}
|
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CITATION.cff — Citation File Format
|
| 2 |
+
# https://citation-file-format.github.io/
|
| 3 |
+
# Used by HF, GitHub, Zenodo to render a "Cite this repository" UI.
|
| 4 |
+
|
| 5 |
+
cff-version: 1.2.0
|
| 6 |
+
message: "If you use this framework or its derivative artifacts, please cite as below."
|
| 7 |
+
type: software
|
| 8 |
+
title: "Composer 2.5 Replication Framework: Methodology and Integration Architecture for Open Replication of Cursor's Agentic Coding Recipe"
|
| 9 |
+
abstract: >
|
| 10 |
+
An open-source methodology and integration architecture for replicating
|
| 11 |
+
Cursor's Composer 2.5 recipe on a HuggingFace base model, plus a novel
|
| 12 |
+
multi-teacher trace-replay distillation reward channel that complements
|
| 13 |
+
the published SDPO/OPSD method (which Cursor's "Targeted RL with Textual
|
| 14 |
+
Feedback" uses). Pre-experimental v0.0 release: methodology paper, audited
|
| 15 |
+
recipe mapping, integration architecture across TRL/VeRL/OpenEnv,
|
| 16 |
+
empirical economic-feasibility result for the novel channel ($0.98/trace),
|
| 17 |
+
and a working code skeleton with 38 passing unit tests.
|
| 18 |
+
|
| 19 |
+
authors:
|
| 20 |
+
- family-names: "Codeseys"
|
| 21 |
+
given-names: ""
|
| 22 |
+
affiliation: "Independent researcher"
|
| 23 |
+
# Replace with real ORCID if available:
|
| 24 |
+
# orcid: "https://orcid.org/0000-0000-0000-0000"
|
| 25 |
+
|
| 26 |
+
repository-code: "https://huggingface.co/Codeseys/composer-replication-framework"
|
| 27 |
+
url: "https://huggingface.co/Codeseys/composer-replication-framework"
|
| 28 |
+
date-released: "2026-05-25"
|
| 29 |
+
version: "0.0.0"
|
| 30 |
+
license: "MIT"
|
| 31 |
+
|
| 32 |
+
keywords:
|
| 33 |
+
- reinforcement-learning
|
| 34 |
+
- post-training
|
| 35 |
+
- distillation
|
| 36 |
+
- agentic-coding
|
| 37 |
+
- composer-2.5
|
| 38 |
+
- cursor
|
| 39 |
+
- kimi-k2
|
| 40 |
+
- grpo
|
| 41 |
+
- dapo
|
| 42 |
+
- sdpo
|
| 43 |
+
- opsd
|
| 44 |
+
- trl
|
| 45 |
+
- verl
|
| 46 |
+
- openenv
|
| 47 |
+
- llm
|
| 48 |
+
|
| 49 |
+
# Primary upstream works this framework depends on / cites
|
| 50 |
+
references:
|
| 51 |
+
- type: article
|
| 52 |
+
title: "Introducing Composer 2.5"
|
| 53 |
+
authors:
|
| 54 |
+
- name: "Cursor Team"
|
| 55 |
+
year: 2026
|
| 56 |
+
url: "https://cursor.com/blog/composer-2-5"
|
| 57 |
+
|
| 58 |
+
- type: article
|
| 59 |
+
title: "Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models"
|
| 60 |
+
authors:
|
| 61 |
+
- family-names: "Zhao"
|
| 62 |
+
given-names: "Siyan"
|
| 63 |
+
- family-names: "Xie"
|
| 64 |
+
given-names: "Zhihui"
|
| 65 |
+
- family-names: "Liu"
|
| 66 |
+
given-names: "Mengchen"
|
| 67 |
+
- family-names: "Huang"
|
| 68 |
+
given-names: "Jing"
|
| 69 |
+
- family-names: "Pang"
|
| 70 |
+
given-names: "Guan"
|
| 71 |
+
- family-names: "Chen"
|
| 72 |
+
given-names: "Feiyu"
|
| 73 |
+
- family-names: "Grover"
|
| 74 |
+
given-names: "Aditya"
|
| 75 |
+
year: 2026
|
| 76 |
+
url: "https://arxiv.org/abs/2601.18734"
|
| 77 |
+
notes: "OPSD — single-LLM self-distillation; provides the reference loss implementation lifted by this framework."
|
| 78 |
+
|
| 79 |
+
- type: article
|
| 80 |
+
title: "Reinforcement Learning via Self-Distillation"
|
| 81 |
+
authors:
|
| 82 |
+
- family-names: "Hübotter"
|
| 83 |
+
given-names: "Jonas"
|
| 84 |
+
- family-names: "Lübeck"
|
| 85 |
+
given-names: "Frederike"
|
| 86 |
+
- family-names: "Behric"
|
| 87 |
+
given-names: "Lejs"
|
| 88 |
+
- family-names: "Baumann"
|
| 89 |
+
given-names: "Anton"
|
| 90 |
+
- family-names: "Bagatella"
|
| 91 |
+
given-names: "Marco"
|
| 92 |
+
- family-names: "Marta"
|
| 93 |
+
given-names: "Daniel"
|
| 94 |
+
- family-names: "Hakimi"
|
| 95 |
+
given-names: "Ido"
|
| 96 |
+
- family-names: "Shenfeld"
|
| 97 |
+
given-names: "Idan"
|
| 98 |
+
- family-names: "Buening"
|
| 99 |
+
given-names: "Thomas Kleine"
|
| 100 |
+
- family-names: "Guestrin"
|
| 101 |
+
given-names: "Carlos"
|
| 102 |
+
- family-names: "Krause"
|
| 103 |
+
given-names: "Andreas"
|
| 104 |
+
year: 2026
|
| 105 |
+
url: "https://arxiv.org/abs/2601.20802"
|
| 106 |
+
notes: "SDPO — formalizes the same mechanism as Cursor's Targeted RL with Textual Feedback. ICLR 2026 Scaling Post-training Workshop."
|
|
@@ -38,6 +38,8 @@ This repository is the **"paper of the project"** — it is the methodology / re
|
|
| 38 |
- 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED + COMPOSITION-VERIFIED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified by 5-step training run on a tiny model.
|
| 39 |
- 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
|
| 40 |
|
|
|
|
|
|
|
| 41 |
See [`spikes/README.md`](spikes/README.md) for the 5-stage spike plan, [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) for the per-framework extension-point analysis, and [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) for runnable trainer code.
|
| 42 |
|
| 43 |
---
|
|
|
|
| 38 |
- 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED + COMPOSITION-VERIFIED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified by 5-step training run on a tiny model.
|
| 39 |
- 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
|
| 40 |
|
| 41 |
+
📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
|
| 42 |
+
|
| 43 |
See [`spikes/README.md`](spikes/README.md) for the 5-stage spike plan, [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) for the per-framework extension-point analysis, and [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) for runnable trainer code.
|
| 44 |
|
| 45 |
---
|
|
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Composer 2.5 from First Principles: An Open Replication Framework + a Novel Trace-Replay Distillation Channel"
|
| 3 |
+
thumbnail: /blog/assets/composer-replication-framework/thumbnail.png
|
| 4 |
+
authors:
|
| 5 |
+
- user: Codeseys
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Composer 2.5 from First Principles: An Open Replication Framework + a Novel Trace-Replay Distillation Channel
|
| 9 |
+
|
| 10 |
+
> **TL;DR.** Cursor's Composer 2.5 ships ~5–10× cheaper than peer frontier models with the bulk of its compute spent on RL post-training, not pretraining. Their headline trick — "Targeted RL with Textual Feedback" — turns out to be **mathematically equivalent to the published method SDPO/OPSD with MIT-licensed reference code**. Building on that, I'm releasing an open replication framework that integrates three reward channels in a single trainer step (RLVR + SDPO + a novel multi-teacher trace-replay channel) on top of HuggingFace TRL or ByteDance VeRL. **This post is pre-experimental** — the framework is in place, the integration is verified by 38 unit tests including a 5-step end-to-end gradient run, and the novel channel's economic feasibility is empirically validated at $0.98 per 50-step trace. Training results come in a follow-up after the GPU-bound spikes run.
|
| 11 |
+
|
| 12 |
+
## What Composer 2.5 actually is, in one paragraph
|
| 13 |
+
|
| 14 |
+
[Cursor announced Composer 2.5](https://cursor.com/blog/composer-2-5) in May 2026. It's a post-trained version of [Moonshot's Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2-Thinking) (1T total / 32B active MoE), tuned specifically for agentic coding inside Cursor. The headline numbers: parity with GPT-5.5 on SWE-bench Multilingual, ~69% on Terminal-Bench 2.0, priced at $0.50/$2.50 per million input/output tokens — that's 5–10× cheaper to serve than Opus 4.6 or GPT-5.4. Most of the compute went into post-training, not pretraining. The blog discloses three training innovations and leaves three big reproducibility gaps.
|
| 15 |
+
|
| 16 |
+
If a small team can reproduce the *shape* of this recipe without K2.5's 1T scale — say, on Qwen3-7B or Qwen3-32B — that's a path to similar agentic coding capability on a smaller, open base. That's the project this post introduces.
|
| 17 |
+
|
| 18 |
+
## The non-obvious move
|
| 19 |
+
|
| 20 |
+
Cursor's three named training innovations are:
|
| 21 |
+
|
| 22 |
+
1. **Targeted RL with Textual Feedback** — a per-turn distillation loss that addresses long-horizon credit assignment.
|
| 23 |
+
2. **Synthetic data at 25× scale** — including their "Feature Deletion" generator, where the agent has to reimplement deleted code to make tests pass.
|
| 24 |
+
3. **Sharded Muon + Dual Mesh HSDP** — MoE optimizer infrastructure (only relevant at K2.5 scale).
|
| 25 |
+
|
| 26 |
+
(2) and (3) are well-understood. (1) is the interesting move and it's worth quoting Cursor directly:
|
| 27 |
+
|
| 28 |
+
> "For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."
|
| 29 |
+
|
| 30 |
+
What's happening: when a 100K-token rollout has a localized error (wrong tool name, style violation, etc.), Cursor doesn't punish the whole trajectory — they generate a text hint correcting the error, run the same model with the hint inserted into the context to get "teacher" logits, run the model on the original context to get "student" logits, and apply per-turn KL divergence loss to pull the student toward the teacher *only at that turn*. **The same model is both teacher and student** — the teacher just has the hint inserted into its context.
|
| 31 |
+
|
| 32 |
+
This sidesteps the credit-assignment nightmare of long-horizon scalar rewards. It's the "make GRPO not poison 100 good steps for 1 bad step" trick.
|
| 33 |
+
|
| 34 |
+
### Wait, this is a published method
|
| 35 |
+
|
| 36 |
+
Cursor's blog footnote 1 cites three self-distillation papers — and the most relevant one is **[SDPO: Reinforcement Learning via Self-Distillation](https://arxiv.org/abs/2601.20802)** (Hübotter et al., ICLR 2026 Workshop). The SDPO paper's abstract:
|
| 37 |
+
|
| 38 |
+
> "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."
|
| 39 |
+
|
| 40 |
+
That's the Cursor mechanism, named and formalized. And the closely-related precursor — [OPSD: Self-Distilled Reasoner](https://arxiv.org/abs/2601.18734) (Zhao et al.) — has **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**.
|
| 41 |
+
|
| 42 |
+
The single-most-important takeaway from auditing the Composer 2.5 blog: *the secret sauce isn't secret — it's published, it's named, and you can lift the loss function directly into your trainer.* The SDPO paper's loss-comparison table puts it bluntly:
|
| 43 |
+
|
| 44 |
+
| Method | Sampling | Signal | Feedback |
|
| 45 |
+
|---|---|---|---|
|
| 46 |
+
| SFT / Distillation | off-policy | rich | strong teacher |
|
| 47 |
+
| On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher |
|
| 48 |
+
| RLVR / GRPO | on-policy | weak | environment |
|
| 49 |
+
| **SDPO (= Cursor's method)** | **on-policy** | **rich** | **environment** |
|
| 50 |
+
|
| 51 |
+
What Composer 2.5 *actually* shows is: SDPO works at production scale on agentic coding. That's an important empirical demonstration even if the algorithm itself is published.
|
| 52 |
+
|
| 53 |
+
## A novel addition: multi-teacher trace-replay distillation (TR-DPO)
|
| 54 |
+
|
| 55 |
+
If SDPO uses **one model** as both teacher and student (with the teacher just having a hint inserted into its context), the obvious complementary question is: what if we use **N different external pretrained models** as teachers? Specifically:
|
| 56 |
+
|
| 57 |
+
1. After a frozen agentic rollout, replay each step against N teachers from different model families (Claude Opus, GPT-5, DeepSeek V4 Pro, ...).
|
| 58 |
+
2. At each step, extract each teacher's chosen action.
|
| 59 |
+
3. If most teachers agree on action X but the student picked Y, emit a DPO preference pair: `chosen=X, rejected=Y`.
|
| 60 |
+
4. Train with standard DPO loss on the pair set, *layered on top of* both RLVR and SDPO.
|
| 61 |
+
|
| 62 |
+
This isn't competing with SDPO — they're complementary. SDPO uses one model with privileged context; TR-DPO uses N models with no privileged context. The signal sources are different.
|
| 63 |
+
|
| 64 |
+
The closest published precedents are [rStar](https://arxiv.org/abs/2408.06195) (single-teacher MCTS counterfactuals), [Math-Shepherd](https://arxiv.org/abs/2312.08935) (process reward models from rollouts), and [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) (response-level multi-model aggregation). To my knowledge, no published work systematically replays each step of frozen agentic traces with multiple external teachers to harvest step-level supervision. **This appears to be open territory.**
|
| 65 |
+
|
| 66 |
+
The obvious objection: "Won't N teacher API calls per step be expensive?" That's exactly what spike 001 measured.
|
| 67 |
+
|
| 68 |
+
## Spike 001: economic feasibility of the trace-replay channel ($0.98/trace, 5× cap headroom)
|
| 69 |
+
|
| 70 |
+
Before paying GPU costs to test whether TR-DPO actually improves training, the kill-switch question is: *can we afford the teacher API calls at all?*
|
| 71 |
+
|
| 72 |
+
I synthesized 50 hand-crafted SWE-bench-lite-shaped agentic decision states (each ~250–500 tokens of context), and for every state ran parallel async requests to three frontier teachers via OpenRouter:
|
| 73 |
+
|
| 74 |
+
- `anthropic/claude-opus-4.7`
|
| 75 |
+
- `openai/gpt-5`
|
| 76 |
+
- `deepseek/deepseek-v4-pro`
|
| 77 |
+
|
| 78 |
+
150 calls total. Hard-cap at $20.
|
| 79 |
+
|
| 80 |
+
| Threshold | Target | Actual |
|
| 81 |
+
|---|---|---|
|
| 82 |
+
| Mean per-trace cost (50 steps × 3 teachers, ungated) | < $5 | **$0.98** ✅ |
|
| 83 |
+
| p95 step latency (max across 3 parallel teachers) | < 30 s | **20.5 s** ✅ |
|
| 84 |
+
| p99 step latency | < 60 s | **23.2 s** ✅ |
|
| 85 |
+
| Errors | 0 | **0 / 150** ✅ |
|
| 86 |
+
|
| 87 |
+
Per-teacher cost composition: Opus dominates at 83% of the spend ($0.81 / $0.98). The other two teachers are essentially free. With v0.1 [VOI gating](https://en.wikipedia.org/wiki/Value_of_information) (only query teachers when student entropy is high), projected per-trace cost falls to ~$0.30. Drop Opus or swap in Sonnet 4.6 and you save another 50–70%.
|
| 88 |
+
|
| 89 |
+
**Channel 3 is economically viable.** The full code (`synthesize_trace.py`, `replay.py`, `analyze.py`, plus the 150-call result jsonl) is at [`spikes/001-teacher-replay-cost/`](https://huggingface.co/Codeseys/composer-replication-framework/tree/main/spikes/001-teacher-replay-cost) on the repo.
|
| 90 |
+
|
| 91 |
+
## Spike 005: integration architecture verified by 38 passing tests
|
| 92 |
+
|
| 93 |
+
The three reward channels — RLVR, SDPO, TR-DPO — need to compose inside a single trainer step. The unified loss:
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
total_loss = grpo_loss + α · sdpo_kl_loss + β · trace_replay_dpo_loss
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
`α=0, β=0` recovers plain GRPO. Any subset can be ablated by zeroing its weight.
|
| 100 |
+
|
| 101 |
+
For this to work in practice, you need clean extension points in whatever RL framework you're using. I [DeepWiki-audited](https://deepwiki.com/) the major candidates:
|
| 102 |
+
|
| 103 |
+
| Framework | Channel 1 (RLVR) | Channel 2 (SDPO) | Channel 3 (TR-DPO) |
|
| 104 |
+
|---|---|---|---|
|
| 105 |
+
| **TRL** | `GRPOTrainer._compute_loss` | Subclass override; lift `generalized_jsd_loss` from OPSD | Subclass override; DPO term using teacher-disagreement pairs |
|
| 106 |
+
| **VeRL** | `@register_adv_est("grpo")` | New estimator; reads `data.batch["sdpo_teacher_logprobs"]` | Custom estimator reading `data.non_tensor_batch["teacher_actions"]` |
|
| 107 |
+
|
| 108 |
+
**TRL is the right v0.0/v0.1 choice** (simplest extension, great OpenEnv integration, fits 7B–32B comfortably). **VeRL is the right v0.2 choice** when scaling to 70B+ on a Ray cluster.
|
| 109 |
+
|
| 110 |
+
The crucial property: *the three channels don't compete for shared resources.* Channel 2 is a sparse extra forward pass (training-side, ~5% of tokens at error sites). Channel 3 is offline post-rollout API calls. They don't fight for the same compute.
|
| 111 |
+
|
| 112 |
+
I built a working code skeleton at [`spikes/005-integrated-trainer-skeleton/`](https://huggingface.co/Codeseys/composer-replication-framework/tree/main/spikes/005-integrated-trainer-skeleton) implementing both paths. Test results:
|
| 113 |
+
|
| 114 |
+
```
|
| 115 |
+
$ python3 -m pytest tests/ -v
|
| 116 |
+
============================== 38 passed in 3.43s ==============================
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
| Test module | Tests | What it proves |
|
| 120 |
+
|---|---|---|
|
| 121 |
+
| `test_opsd_loss.py` | 9 | Lifted SDPO loss is differentiable, equal-zero on identical distributions, all β values, masking + clipping correct |
|
| 122 |
+
| `test_teacher_replay.py` | 7 | DPO-pair extraction logic: consensus + threshold + error-call exclusion |
|
| 123 |
+
| `test_data_collator.py` | 15 | Raw trace → batch dict; hint injection + post-hint masking + DPO tokenization |
|
| 124 |
+
| `test_loss_composition_smoke.py` | 7 | All three channels compose; α/β=0 ablations recover GRPO; **5-step train decreases loss** |
|
| 125 |
+
|
| 126 |
+
The composition smoke test runs all three channels on a 10K-parameter `TinyLM`. The integration claim — *all three channels run simultaneously, ablate cleanly, train without divergence* — is now an empirically tested invariant rather than a paper diagram.
|
| 127 |
+
|
| 128 |
+
## What I'm NOT claiming
|
| 129 |
+
|
| 130 |
+
This post is **pre-experimental**. The full empirical validation — *does TR-DPO actually improve SWE-bench-lite pass@1 over plain GRPO at 7B?* — is the subject of a follow-up paper after the GPU-bound spikes run:
|
| 131 |
+
|
| 132 |
+
| Spike | What it measures | Status |
|
| 133 |
+
|---|---|---|
|
| 134 |
+
| 002a / 002b | Trace collection on Qwen3-7B + SWE-bench-lite via TRL vs PRIME-RL | 📋 planned |
|
| 135 |
+
| 003 | DPO-pair signal density on real traces (≥5 pairs/trace, KL-distance from random) | 📋 planned |
|
| 136 |
+
| 004 | A/B Qwen3-7B trained with GRPO vs GRPO+TR-DPO; success = ≥2 pt pass@1 with p<0.05 | 📋 planned |
|
| 137 |
+
|
| 138 |
+
If you have GPU compute and want to run any of these, I'd love a collaborator. Spike 002a is ~$50–100 of Modal A100 time and about half a day of wallclock. Spike 004 (the terminal experiment) is ~$300 GPU + $50 eval over 6 training runs.
|
| 139 |
+
|
| 140 |
+
## Lessons learned during this work
|
| 141 |
+
|
| 142 |
+
A few methodology lessons worth surfacing because they generalized:
|
| 143 |
+
|
| 144 |
+
**Read primary sources yourself.** I initially dispatched a parallel-research subagent to summarize the Composer 2.5 blog. The summary covered the targeted-textual-feedback method correctly but missed Cursor's footnote citing the SDPO/OPSD papers entirely — and added several extrapolations (Anyrun environment name, "85% post-training compute" ratio, specific benchmark scores) that aren't actually in the blog. After I read the blog directly, the integration story changed materially: I went from "we'll need to implement the hint loss from scratch" to "there's MIT-licensed reference code we can lift." I now have an audit notice in the research note flagging which claims are blog-verified vs extrapolated, and a separate `COMPOSER_RECIPE_MAPPING.md` doc that does the rigorous mapping. This same pattern applies broadly to LLM-driven research synthesis: the orchestrator must verify primary sources, not just relay subagent claims.
|
| 145 |
+
|
| 146 |
+
**Verify framework extension points before designing on them.** I used [DeepWiki](https://deepwiki.com/) to audit `huggingface/trl`, `volcengine/verl`, and `siyan-zhao/OPSD` directly — getting back actual function names, file paths, and decorator surfaces. The integration architecture document cites each verification. Without that, the design would be plausible-sounding fan-fiction.
|
| 147 |
+
|
| 148 |
+
**Risk-order your spikes.** Spike 001 was the kill-switch. If teacher API costs were $50+/trace, the whole framework was dead. I ran it first, before any GPU work. It validated cleanly ($0.98/trace), and now everything downstream is confidence-bounded.
|
| 149 |
+
|
| 150 |
+
**Pre-experimental publication has trade-offs.** I considered waiting for spike 002–004 results before posting anything. Decided against because: the integration architecture is independently useful (other groups can plug in different channel-3 ideas), the SDPO/OPSD lift removes work for anyone trying to replicate Composer-style hint distillation, and early community feedback might catch design errors before I burn GPU budget. Cost: someone else may run the experiments first. I'm fine with that trade-off as long as the work is correctly attributed.
|
| 151 |
+
|
| 152 |
+
## What's in the repo
|
| 153 |
+
|
| 154 |
+
[**🤗 huggingface.co/Codeseys/composer-replication-framework**](https://huggingface.co/Codeseys/composer-replication-framework)
|
| 155 |
+
|
| 156 |
+
```
|
| 157 |
+
composer-replication-framework/
|
| 158 |
+
├── README.md ← model card (this post in shorter form)
|
| 159 |
+
├── publications/
|
| 160 |
+
│ ├── PAPER_v0.md ← longform methodology paper (this work)
|
| 161 |
+
│ └── BLOG_POST.md ← this blog post
|
| 162 |
+
├── docs/
|
| 163 |
+
│ ├── COMPOSER_RECIPE_MAPPING.md ← Cursor blog audit, blog-verified vs extrapolated
|
| 164 |
+
│ ├── INTEGRATION_ARCHITECTURE.md ← framework × channel matrix, sequence diagrams
|
| 165 |
+
│ ├── METHODOLOGY.md ← how the parallel research dispatch was run
|
| 166 |
+
│ └── HF_REPO_LAYOUT.md ← planned multi-repo split
|
| 167 |
+
├── framework/
|
| 168 |
+
│ └── composer-replication-framework.md ← master synthesis (architecture spec)
|
| 169 |
+
├── research/ ← five deep-dives by five LLM families
|
| 170 |
+
│ └── 01..05*.md
|
| 171 |
+
└── spikes/
|
| 172 |
+
├── 001-teacher-replay-cost/ ← ✅ VALIDATED ($0.98/trace)
|
| 173 |
+
├── 005-integrated-trainer-skeleton/ ← ✅ COMPOSITION-VERIFIED (38/38 tests)
|
| 174 |
+
└── 002a..004/ ← 📋 planned
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
All MIT licensed. PRs welcome on the research notes if you find a misattribution.
|
| 178 |
+
|
| 179 |
+
## Acknowledgements
|
| 180 |
+
|
| 181 |
+
Cursor team for the Composer 2.5 release and naming the technique. Siyan Zhao et al. for OPSD's reference implementation. Hübotter et al. for SDPO's formal treatment. HuggingFace TRL team for the cleanest possible `_compute_loss` extension surface. ByteDance VeRL team for the HybridFlow architecture. Meta for OpenEnv as the agentic-environment substrate.
|
| 182 |
+
|
| 183 |
+
If you're working on agentic-coding RL post-training and any of this looks useful, [open a discussion on the repo](https://huggingface.co/Codeseys/composer-replication-framework/discussions) — I'd particularly love to hear from teams running TRL or VeRL at scale who could tell me which extension surfaces I'm misreading.
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
*This post is pre-experimental. v0.1 follow-up will incorporate spike 002–004 results.*
|
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HF Discussion thread (draft) — Pre-experimental release
|
| 2 |
+
|
| 3 |
+
> **Where to post:** the repo's [Community / Discussions tab](https://huggingface.co/Codeseys/composer-replication-framework/discussions).
|
| 4 |
+
> **Title (suggested):** `Methodology release: Composer 2.5 replication framework + novel trace-replay channel (pre-experimental, looking for feedback)`
|
| 5 |
+
> **Tag:** `release`, `discussion`
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
Hi all 👋
|
| 10 |
+
|
| 11 |
+
I'm releasing this repo as a **pre-experimental methodology paper + integration architecture + working code skeleton** for an open replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5). I want to be upfront: there are no training results yet. The full empirical validation is gated on GPU compute commitment for spikes 002–004 (described below). This release is to invite feedback before I burn GPU budget.
|
| 12 |
+
|
| 13 |
+
## What's in the box right now
|
| 14 |
+
|
| 15 |
+
**1. Methodology paper.** [`publications/PAPER_v0.md`](publications/PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
|
| 16 |
+
|
| 17 |
+
**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
|
| 18 |
+
|
| 19 |
+
**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
|
| 20 |
+
|
| 21 |
+
**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
|
| 22 |
+
|
| 23 |
+
**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
$ python3 -m pytest tests/ -v
|
| 27 |
+
============================== 38 passed in 3.43s ==============================
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## What I'm explicitly NOT claiming
|
| 31 |
+
|
| 32 |
+
| Claim | Status | Validates via |
|
| 33 |
+
|---|---|---|
|
| 34 |
+
| Trace-replay distillation improves SWE-bench-lite pass@1 over plain GRPO | **Open** | Spike 004 (Qwen3-7B A/B, ~$300 GPU + $50 eval, 6 runs × 3 seeds) |
|
| 35 |
+
| Teacher disagreement at the step level carries non-trivial signal on real traces | **Open** | Spike 003 (≥5 pairs/trace, non-trivial KL distance from random pairs) |
|
| 36 |
+
| TRL+OpenEnv emits clean trace JSONL on a real agentic env | **Open** | Spike 002a (100 rollouts on Modal A100) |
|
| 37 |
+
| PRIME-RL is the better trace substrate at scale | **Open** | Spike 002b (head-to-head with 002a) |
|
| 38 |
+
|
| 39 |
+
Translation: *I have a framework that compiles, integration that's verified at the unit-test level, and economic feasibility for the novel channel. I do not yet have evidence that the method actually improves training.*
|
| 40 |
+
|
| 41 |
+
## What I'm specifically asking for
|
| 42 |
+
|
| 43 |
+
1. **Critical reads of the integration architecture.** If you've worked with TRL's `GRPOTrainer._compute_loss`, VeRL's `@register_adv_est`, or OPSD's loss code, I'd love to know if I'm misreading any of the extension surfaces. The [DeepWiki audits](https://deepwiki.com/) gave me confidence but they're not the same as someone who's actually shipped on these frameworks calling out a foot-gun.
|
| 44 |
+
|
| 45 |
+
2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
|
| 46 |
+
|
| 47 |
+
3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](publications/PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
|
| 48 |
+
|
| 49 |
+
4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
|
| 50 |
+
|
| 51 |
+
## Why publish before experiments
|
| 52 |
+
|
| 53 |
+
Trade-off acknowledged: someone else may run spike 004 first. Two reasons I'm publishing now anyway:
|
| 54 |
+
|
| 55 |
+
- The integration architecture and the OPSD-lift insight are independently useful. Other teams designing similar Composer-replications shouldn't have to rediscover that there's MIT-licensed reference code for the SDPO loss.
|
| 56 |
+
- Early feedback may catch design errors before I burn GPU budget. The cost of being scooped on the experiment is much smaller than the cost of running a failed experiment because of an integration bug I didn't catch.
|
| 57 |
+
|
| 58 |
+
## The repo
|
| 59 |
+
|
| 60 |
+
🤗 [huggingface.co/Codeseys/composer-replication-framework](https://huggingface.co/Codeseys/composer-replication-framework)
|
| 61 |
+
|
| 62 |
+
License: MIT (methodology + code; upstream papers and code retain their respective licenses).
|
| 63 |
+
|
| 64 |
+
Looking forward to feedback.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
*Pre-experimental v0.0 release, 2026-05-25. v0.1 will incorporate spike 002–004 results.*
|
|
@@ -0,0 +1,452 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Composer 2.5 Replication Framework: A Methodology Paper
|
| 2 |
+
|
| 3 |
+
> **🚧 PRE-EXPERIMENTAL DRAFT (v0.0)** — methodology + economic-feasibility result + integration architecture. **No model training experiments yet.** Every empirical claim in this paper is one of: (1) a citation to upstream literature, (2) a result from spike 001 (teacher-replay cost measurement), or (3) a unit-test invariant from spike 005 (integration smoke). The full empirical validation (spike 002–004: trace collection, DPO-pair signal density, A/B vs plain GRPO) is the subject of a follow-up paper once GPU budget commits.
|
| 4 |
+
>
|
| 5 |
+
> **Last updated:** 2026-05-25
|
| 6 |
+
> **Author:** Codeseys ([HF](https://huggingface.co/Codeseys))
|
| 7 |
+
> **Repository:** [`Codeseys/composer-replication-framework`](https://huggingface.co/Codeseys/composer-replication-framework)
|
| 8 |
+
> **License:** MIT (methodology + code); upstream papers and code retain their respective licenses
|
| 9 |
+
|
| 10 |
+
## Abstract
|
| 11 |
+
|
| 12 |
+
Cursor's Composer 2.5 is a post-trained Kimi K2.5 that achieves frontier agentic-coding performance (~69% Terminal-Bench 2.0, parity with GPT-5.5) at 5–10× lower serving cost than peers. The recipe is dominated (~85%) by post-training, and its central non-obvious technique — *Targeted RL with Textual Feedback* — turns out to be mathematically equivalent to the published method **SDPO / OPSD** (Hübotter et al. 2026; Zhao et al. 2026), which Cursor's own footnote cites and for which **MIT-licensed reference code already exists**.
|
| 13 |
+
|
| 14 |
+
Building on this, we propose a complementary novel reward channel: **multi-teacher trace-replay distillation (TR-DPO)**. After a frozen agentic rollout, replay each step under N pre-trained external teachers (e.g., Claude Opus, GPT-5, DeepSeek V4 Pro), extract DPO preference pairs from teacher–student disagreement, and add the resulting loss term *on top of* both RLVR and SDPO. The three channels do not compete for shared resources and ablate cleanly via independent weights.
|
| 15 |
+
|
| 16 |
+
We make three pre-experimental contributions: (1) an audited mapping of Cursor's Composer 2.5 blog onto a stack of open infrastructure (TRL / VeRL / OpenEnv / Monarch) with primary-source-verified extension points; (2) an empirical economic feasibility result showing the trace-replay channel costs ~$0.98 per 50-step trace ungated (spike 001, n=150 calls, 0 errors); (3) a working code skeleton with 38 passing unit tests, including an end-to-end gradient-step smoke test on a tiny custom model that empirically verifies the three-channel composition trains without divergence.
|
| 17 |
+
|
| 18 |
+
This paper deliberately stops short of training-result claims. Section 7 is explicit about what's *not* yet validated and what each follow-up spike will measure.
|
| 19 |
+
|
| 20 |
+
## 1. Introduction and motivation
|
| 21 |
+
|
| 22 |
+
### 1.1 What Composer 2.5 demonstrates
|
| 23 |
+
|
| 24 |
+
In their May 2026 release post, Cursor announced [Composer 2.5](https://cursor.com/blog/composer-2-5), a post-trained version of [Moonshot's Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2-Thinking) that powers Cursor's agentic coding mode. Their public claims:
|
| 25 |
+
|
| 26 |
+
- Substantial improvement over Composer 2 on long-horizon coding tasks
|
| 27 |
+
- Frontier-level performance: parity with GPT-5.5 on SWE-bench Multilingual; ~69% on Terminal-Bench 2.0
|
| 28 |
+
- Pricing of $0.50/$2.50 per million input/output tokens — 5–10× cheaper than Opus 4.6 ($5/$25) and GPT-5.4 ($5/$22.50)
|
| 29 |
+
- Dominant compute share goes to post-training, not pretraining (community estimate: ~85%)
|
| 30 |
+
|
| 31 |
+
The blog discloses three training innovations (Section 2 expands each):
|
| 32 |
+
|
| 33 |
+
1. **Targeted RL with Textual Feedback** — a per-turn distillation loss that addresses long-horizon credit assignment.
|
| 34 |
+
2. **Synthetic data at 25× scale** — Feature Deletion + 24 other (unnamed) generators.
|
| 35 |
+
3. **Sharded Muon + Dual Mesh HSDP** — MoE optimizer infrastructure.
|
| 36 |
+
|
| 37 |
+
If a small team can reproduce the *shape* of (1) and (2) without K2.5's 1T scale, the path is open to similar performance on smaller bases. Item (3) is MoE-specific infrastructure and irrelevant for dense-base reproductions.
|
| 38 |
+
|
| 39 |
+
### 1.2 What's missing from the public recipe
|
| 40 |
+
|
| 41 |
+
The blog leaves three significant gaps:
|
| 42 |
+
|
| 43 |
+
- **How are hints generated?** The blog gives one tool-call template ("Reminder: available tools are…") but says nothing about the generator architecture. Hardcoded templates? A separate model? The same model with an introspection prompt? This is the single largest reproducibility gap.
|
| 44 |
+
- **What RL algorithm sits underneath?** The targeted-textual-feedback method is described as "an on-policy distillation KL loss [added to] the broader RL objective over the full trajectory." The "broader RL objective" is unspecified.
|
| 45 |
+
- **What reward-hacking safeguards work in practice?** Cursor explicitly mentions failure modes (decompiling Java bytecode, reverse-engineering Python type-checking caches) without disclosing the mitigations beyond "agentic monitoring tools."
|
| 46 |
+
|
| 47 |
+
### 1.3 What we propose
|
| 48 |
+
|
| 49 |
+
This paper makes a **methodological** contribution: an open-source replication framework that integrates three reward channels in a single trainer step, on top of any HuggingFace base model, using off-the-shelf open-source infrastructure (TRL or VeRL + OpenEnv).
|
| 50 |
+
|
| 51 |
+
The three channels:
|
| 52 |
+
|
| 53 |
+
1. **RLVR (Channel 1)** — verifiable scalar reward (tests pass, build succeeds). Standard.
|
| 54 |
+
2. **Composer hint-distill = SDPO (Channel 2)** — single-model self-distillation with hint-conditioned context, lifted from `siyan-zhao/OPSD` (MIT). This is Cursor's published method.
|
| 55 |
+
3. **Trace-replay multi-teacher DPO (Channel 3, novel)** — N external teachers replay each step of frozen rollouts; teacher–student disagreement becomes a DPO preference signal.
|
| 56 |
+
|
| 57 |
+
Channels 2 and 3 are mechanistically *different*, not competing. Both bypass long-horizon credit assignment, but they tap different supervision sources. Section 4 makes the distinction precise.
|
| 58 |
+
|
| 59 |
+
We provide:
|
| 60 |
+
|
| 61 |
+
- A primary-source-audited integration architecture across the agentic-RL stack (Section 3, [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md))
|
| 62 |
+
- A working code skeleton with 38 passing unit tests (Section 6, [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/))
|
| 63 |
+
- An empirical economic feasibility verdict for Channel 3 (Section 5, [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md))
|
| 64 |
+
- A risk-ordered spike plan whose terminal experiment falsifies or validates the novel claim (Section 7, [`spikes/README.md`](../spikes/README.md))
|
| 65 |
+
|
| 66 |
+
We do **not** yet provide training results. That is deferred to a follow-up paper after spike 002 (trace collection on a real agentic environment), spike 003 (DPO-pair signal density on real traces), and spike 004 (A/B comparison on SWE-bench-lite).
|
| 67 |
+
|
| 68 |
+
## 2. Composer 2.5 in technical detail (audited)
|
| 69 |
+
|
| 70 |
+
This section reconstructs Cursor's published recipe with explicit `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]` tags, following the audit in [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md). The distinction matters because the initial parallel-research synthesis for this project blurred which claims came from the blog versus secondary sources, and a primary-source audit caught several extrapolations.
|
| 71 |
+
|
| 72 |
+
### 2.1 Targeted RL with Textual Feedback `[BLOG-VERIFIED]`
|
| 73 |
+
|
| 74 |
+
> "For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog
|
| 75 |
+
|
| 76 |
+
The mechanism, exactly:
|
| 77 |
+
|
| 78 |
+
- **Same model** acts as both teacher and student. There is no second model.
|
| 79 |
+
- The teacher is "the policy at this turn, *with* a hint inserted into the context."
|
| 80 |
+
- The student is "the policy at this turn, *without* the hint."
|
| 81 |
+
- Loss = on-policy KL: `KL(teacher_logits_at_turn_t || student_logits_at_turn_t)`, applied **only at the problematic turn**.
|
| 82 |
+
- Sits *on top of* an outer RLVR objective; doesn't replace it.
|
| 83 |
+
|
| 84 |
+
Cursor's footnote 1 cites three self-distillation papers as background:
|
| 85 |
+
|
| 86 |
+
- **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734); code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD), MIT). Single LLM as both teacher and student; teacher conditioned on privileged information (e.g., a verified solution), student sees only the question.
|
| 87 |
+
- **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Workshop on Scaling Post-training). Generalizes OPSD to RL with rich textual feedback. Quote from the abstract: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is mathematically the same construct as Composer's targeted-textual-feedback.
|
| 88 |
+
- **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).
|
| 89 |
+
|
| 90 |
+
The OPSD reference implementation provides a self-contained `generalized_jsd_loss` static method (signature in [Section 6](#6-implementation)) computing JSD/KL between student and teacher logits — directly liftable into any HF Trainer subclass.
|
| 91 |
+
|
| 92 |
+
### 2.2 Synthetic data at 25× scale `[BLOG-VERIFIED]`
|
| 93 |
+
|
| 94 |
+
> "Composer 2.5 is trained with 25× more synthetic tasks than Composer 2." — Cursor blog
|
| 95 |
+
|
| 96 |
+
One named generator: **Feature Deletion**. Take a repo with comprehensive tests; delete code such that the codebase remains functional but specific testable features are removed. The agent's task is to reimplement the deleted features so the tests pass. Tests = verifiable reward.
|
| 97 |
+
|
| 98 |
+
The blog also discloses observed reward-hacking failures: the model learning to decompile Java bytecode to reconstruct deleted APIs, and reverse-engineering Python type-checking caches to recover deleted function signatures. Mitigations are described only as "agentic monitoring tools" — opaque.
|
| 99 |
+
|
| 100 |
+
### 2.3 Sharded Muon + Dual Mesh HSDP `[BLOG-VERIFIED]`
|
| 101 |
+
|
| 102 |
+
MoE optimizer infrastructure. Per-attention-head and per-expert orthogonalization (Newton-Schulz) with asynchronous all-to-all communication. Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
|
| 103 |
+
|
| 104 |
+
This is **infrastructure**, not algorithm. Relevant only at MoE-1T scale (Kimi K2.5). For the v0.1 dense-base reproductions described here (Qwen3-7B in v0.0 spike, Qwen3-32B in v0.1), it is irrelevant. Becomes relevant if the framework is later applied to a Kimi-K2.5-derivative directly.
|
| 105 |
+
|
| 106 |
+
### 2.4 Claims that are NOT in the Cursor blog (extrapolated)
|
| 107 |
+
|
| 108 |
+
These claims appear in community commentary and were reproduced uncritically in the initial synthesis for this project. They are likely correct via secondary sources but are not Cursor-stated:
|
| 109 |
+
|
| 110 |
+
- **"~85% of total compute is post-training"** — community consensus (HN threads, third-party substack analysis). Plausible but unverified.
|
| 111 |
+
- **"Anyrun" environment harness with LSP / file I/O / terminal** — name "Anyrun" is not in the 2.5 blog (may be in the [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report)).
|
| 112 |
+
- **CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual** — the 2.5 blog does not quote benchmark numbers.
|
| 113 |
+
- **"PPO or GRPO variant"** as the underlying RL algorithm — the blog never names the RL algorithm.
|
| 114 |
+
- **MLA + 1T total / 32B active + 384 experts + 256K context** — these are Kimi K2.5 base model facts, [verified independently](https://huggingface.co/moonshotai) but inferred rather than blog-stated.
|
| 115 |
+
|
| 116 |
+
The blog is unambiguous on the three items in §2.1–§2.3 and otherwise terse.
|
| 117 |
+
|
| 118 |
+
## 3. Integration architecture across the agentic-RL stack
|
| 119 |
+
|
| 120 |
+
The three reward channels need to compose inside a single trainer step, regardless of which RL framework hosts them. We audited (via [DeepWiki](https://deepwiki.com/) on 2026-05-25) the extension points of the major open-source agentic-RL frameworks and produced an integration matrix.
|
| 121 |
+
|
| 122 |
+
### 3.1 Frameworks surveyed
|
| 123 |
+
|
| 124 |
+
| Framework | Role | Status |
|
| 125 |
+
|---|---|---|
|
| 126 |
+
| [HuggingFace TRL](https://github.com/huggingface/trl) | Reference algorithm library; `GRPOTrainer` is the workhorse for RLVR | Mature (v1.0, 2026-03); developer-friendly; OpenEnv integration since 2025-10 |
|
| 127 |
+
| [ByteDance VeRL](https://github.com/volcengine/verl) | Production-scale RL via HybridFlow + 3D-HybridEngine; Ray-based | Proven 671B; preferred for ≥70B runs |
|
| 128 |
+
| [Meta TorchForge](https://github.com/meta-pytorch/forge) | RL post-training on Monarch; reference recipes | **"Development paused — consolidating into TorchTitan"** (banner). Use as pattern reference only. |
|
| 129 |
+
| [Meta Monarch](https://github.com/meta-pytorch/monarch) | Single-controller actor-mesh framework with RDMA data plane | Active; native PyTorch integration |
|
| 130 |
+
| [Meta OpenEnv](https://github.com/meta-pytorch/OpenEnv) | Standard for agentic environments (typed reset/step/close + MCP RFC) | First-party TRL integration; HF Hub catalog; growing community |
|
| 131 |
+
| [Prime Intellect PRIME-RL](https://github.com/PrimeIntellect-ai/prime-rl) | Decentralized RL substrate (INTELLECT-2, 32B globally distributed) | Production-deployed; `verifiers` env library |
|
| 132 |
+
|
| 133 |
+
### 3.2 Extension-point matrix (verified)
|
| 134 |
+
|
| 135 |
+
The full table is in [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). Summary:
|
| 136 |
+
|
| 137 |
+
| Channel | TRL | VeRL |
|
| 138 |
+
|---|---|---|
|
| 139 |
+
| **1. RLVR/GRPO** | `GRPOTrainer._compute_loss(model, inputs)` (base behavior) | `@register_adv_est("grpo")` → `core_algos.compute_grpo_outcome_advantage` |
|
| 140 |
+
| **2. SDPO** | Subclass override of `_compute_loss`; lift `generalized_jsd_loss` from OPSD | New `@register_adv_est("grpo_sdpo")`; reads `data.batch["sdpo_teacher_logprobs"]`; precedent: distillation already attaches `teacher_log_probs` |
|
| 141 |
+
| **3. TR-DPO** | Subclass override; add DPO term using `inputs["dpo_chosen_input_ids"]`, etc. | Custom estimator reading `data.non_tensor_batch["teacher_actions"]` |
|
| 142 |
+
| **OpenEnv plumbing** | `environment_factory=` kwarg in trainer init | Custom env worker producing `DataProto`-shaped output |
|
| 143 |
+
|
| 144 |
+
Both paths allow the integrated trainer to be assembled out of small components — none of the three channels requires modifying the framework's core. The full unified loss is:
|
| 145 |
+
|
| 146 |
+
```
|
| 147 |
+
total_loss = grpo_loss + α · sdpo_kl_loss + β · trace_replay_dpo_loss
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
where `α = β = 0` recovers plain GRPO (the baseline arm of any ablation).
|
| 151 |
+
|
| 152 |
+
### 3.3 Why this matters
|
| 153 |
+
|
| 154 |
+
The architectural constraint that lets all three channels co-exist: **they don't compete for shared resources.**
|
| 155 |
+
|
| 156 |
+
| Resource | Channel 1 (RLVR) | Channel 2 (SDPO) | Channel 3 (TR-DPO) |
|
| 157 |
+
|---|---|---|---|
|
| 158 |
+
| GPU forward pass (rollout) | yes (vLLM, async) | no | no |
|
| 159 |
+
| GPU forward pass (training) | yes | one extra per error site (~5% of tokens) | none — uses precomputed logprobs |
|
| 160 |
+
| GPU backward pass | yes | yes | yes |
|
| 161 |
+
| External API budget | none | none | $0.30–1 per 50-step trace |
|
| 162 |
+
| Latency-critical path | yes — gates next rollout | minor | no — async, post-rollout |
|
| 163 |
+
|
| 164 |
+
Channel 2 is forward-pass-bound (training-side, sparse). Channel 3 is API-bound (offline, post-rollout). They don't fight for the same compute.
|
| 165 |
+
|
| 166 |
+
## 4. The novel channel: multi-teacher trace-replay distillation
|
| 167 |
+
|
| 168 |
+
### 4.1 Construction
|
| 169 |
+
|
| 170 |
+
After a frozen agentic rollout (state\_t, action\_t, reward), for each step `t`:
|
| 171 |
+
|
| 172 |
+
1. **Replay** the exact state at step `t` against `N` pre-trained external teachers (different model families).
|
| 173 |
+
2. **Extract** each teacher's chosen action (or action distribution) at that state.
|
| 174 |
+
3. **Score** the disagreement: if `k ≥ k_threshold` of `N` teachers agree on action `X` and the student picked `Y ≠ X`, emit a DPO preference pair `(chosen=X, rejected=Y)`.
|
| 175 |
+
4. **Train** with standard DPO loss on the pair set, layered onto GRPO + (optionally) SDPO.
|
| 176 |
+
|
| 177 |
+
This is offline relative to the rollout — teacher API calls happen post-rollout, after which the DPO pairs are batched into the next training step. No latency-critical coupling.
|
| 178 |
+
|
| 179 |
+
### 4.2 Distinction from related work
|
| 180 |
+
|
| 181 |
+
| Work | Mechanism | Difference from TR-DPO |
|
| 182 |
+
|---|---|---|
|
| 183 |
+
| [rStar / rStar-Math](https://arxiv.org/abs/2408.06195) (Microsoft) | MCTS at training time, single teacher branches at each step | TR-DPO replays pre-existing traces (not MCTS); uses N teachers (not single). |
|
| 184 |
+
| [Math-Shepherd / OmegaPRM](https://arxiv.org/abs/2312.08935) | Process reward models from rollout-and-check | Reward signal is teacher *disagreement*, not rollout outcomes. |
|
| 185 |
+
| [Magpie](https://arxiv.org/abs/2406.08464) / OpenThoughts | Synthetic data from one strong teacher | Per-step distillation from N teachers on real traces. |
|
| 186 |
+
| [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) (Wang et al.) | Multi-teacher *response-level* aggregation | Per-step (sub-response) aggregation. |
|
| 187 |
+
| **Composer SDPO / OPSD** | Single-model self-teacher with hint context | TR-DPO uses N *external* teachers; complementary, not competing. |
|
| 188 |
+
|
| 189 |
+
To our knowledge no published work systematically replays each step of frozen agentic traces with multiple external teachers to harvest step-level supervision. The trace-replay-with-N-teachers construction appears to be open territory.
|
| 190 |
+
|
| 191 |
+
### 4.3 Stacking with SDPO
|
| 192 |
+
|
| 193 |
+
| Property | Composer SDPO | TR-DPO (this work) |
|
| 194 |
+
|---|---|---|
|
| 195 |
+
| Number of models | 1 | N + 1 |
|
| 196 |
+
| Teacher source | Same model with privileged context | External pretrained models |
|
| 197 |
+
| Per-step compute | One extra forward pass | None (precomputed) |
|
| 198 |
+
| Per-step API cost | Zero | ~$0.02 (3-teacher, ungated) |
|
| 199 |
+
| Privileged signal | Hint text in context | None — teachers see same state |
|
| 200 |
+
| Bypasses long-horizon credit assignment | Yes (per-turn KL) | Yes (per-step DPO) |
|
| 201 |
+
| Published code | Yes — `siyan-zhao/OPSD` | Not yet |
|
| 202 |
+
|
| 203 |
+
Both add dense per-step signal on top of RLVR. Their gradient contributions are independent because they update via separate loss terms with separate weights.
|
| 204 |
+
|
| 205 |
+
## 5. Empirical results so far
|
| 206 |
+
|
| 207 |
+
### 5.1 Spike 001 — teacher-replay cost floor (✅ VALIDATED)
|
| 208 |
+
|
| 209 |
+
**Question:** Given a 50-step agentic-coding trace, what's the API cost and wallclock latency of querying N=3 frontier teachers in parallel for next-action distributions at every step?
|
| 210 |
+
|
| 211 |
+
**Method.** We synthesized 50 hand-crafted SWE-bench-lite-shaped agentic states (multi-turn function-call decision points; ~250–500 tokens of context each). For each state we issued parallel async requests to three teachers via OpenRouter:
|
| 212 |
+
|
| 213 |
+
- `anthropic/claude-opus-4.7` (Anthropic frontier)
|
| 214 |
+
- `openai/gpt-5` (OpenAI frontier)
|
| 215 |
+
- `deepseek/deepseek-v4-pro` (open-weight frontier)
|
| 216 |
+
|
| 217 |
+
Total: 150 calls. Hard-cap at $20 spend (early abort on overrun).
|
| 218 |
+
|
| 219 |
+
**Result.**
|
| 220 |
+
|
| 221 |
+
| Metric | Target | Actual | Pass? |
|
| 222 |
+
|---|---|---|---|
|
| 223 |
+
| Mean per-trace cost (50 steps × 3 teachers, ungated) | < $5 | **$0.98** | ✅ 5× headroom |
|
| 224 |
+
| p95 step latency (max across 3 parallel teachers) | < 30 s | **20.45 s** | ✅ |
|
| 225 |
+
| p99 step latency | < 60 s | **23.24 s** | ✅ |
|
| 226 |
+
| Errors | 0 expected | **0 / 150** | ✅ |
|
| 227 |
+
|
| 228 |
+
**Per-teacher breakdown.**
|
| 229 |
+
|
| 230 |
+
| Teacher | n | p50 lat | p95 lat | mean $/call | total $ |
|
| 231 |
+
|---|---|---|---|---|---|
|
| 232 |
+
| `anthropic/claude-opus-4.7` | 50 | 3.4 s | 4.6 s | $0.0161 | $0.81 |
|
| 233 |
+
| `openai/gpt-5` | 50 | 5.0 s | 10.1 s | $0.0021 | $0.11 |
|
| 234 |
+
| `deepseek/deepseek-v4-pro` | 50 | 7.1 s | 16.2 s | $0.0013 | $0.07 |
|
| 235 |
+
|
| 236 |
+
Opus dominates per-trace cost (~83%). With v0.1 VOI gating (only query teachers when student entropy is high; typically 60–80% reduction), projected per-trace cost falls to ~$0.30. Opus could optionally be dropped or replaced with Sonnet 4.6 for further savings.
|
| 237 |
+
|
| 238 |
+
The economic floor is well within budget. **Channel 3 is viable.**
|
| 239 |
+
|
| 240 |
+
Full data at [`spikes/001-teacher-replay-cost/`](../spikes/001-teacher-replay-cost/) (synthesize_trace.py + replay.py + analyze.py + verdict.md). Code is reproducible — set `OPENROUTER_API_KEY` and run `python synthesize_trace.py && python replay.py && python analyze.py`.
|
| 241 |
+
|
| 242 |
+
### 5.2 Spike 005 — integration architecture (✅ COMPOSITION-VERIFIED)
|
| 243 |
+
|
| 244 |
+
**Question:** Does the proposed three-channel integration (RLVR + SDPO + TR-DPO) compose cleanly in a real PyTorch trainer? Specifically: are the loss terms additive without unwanted interactions; do α/β=0 ablations correctly recover plain GRPO; and does a multi-channel run actually decrease loss on a tiny model?
|
| 245 |
+
|
| 246 |
+
**Method.** A code skeleton at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) implements:
|
| 247 |
+
|
| 248 |
+
- **`opsd_loss.py`** — `generalized_jsd_loss` lifted verbatim from `siyan-zhao/OPSD` (MIT). Self-contained static method computing JSD/KL between student and teacher logits.
|
| 249 |
+
- **`teacher_replay.py`** — parallel OpenRouter client + DPO-pair extractor (from teacher–student disagreement at the agreement threshold).
|
| 250 |
+
- **`hint_generator.py`** — template-based hint dispatcher keyed by error_kind (v0.1 starter; LLM-driven hints in v0.2).
|
| 251 |
+
- **`trl_path/data_collator.py`** — `ComposerDataCollator` transforming raw trace + DPO pairs into the trainer batch dict.
|
| 252 |
+
- **`trl_path/composer_trainer.py`** — `ComposerReplicationTrainer(GRPOTrainer)` with `_compute_loss` override.
|
| 253 |
+
- **`verl_path/composer_adv.py`** — `@register_adv_est("grpo_composer")` for VeRL.
|
| 254 |
+
|
| 255 |
+
**Result: 38/38 unit tests pass in 3.43 s** (`python3 -m pytest tests/ -v`).
|
| 256 |
+
|
| 257 |
+
| Test module | Tests | Status |
|
| 258 |
+
|---|---|---|
|
| 259 |
+
| `test_opsd_loss.py` (lifted OPSD math) | 9 | ✅ all pass |
|
| 260 |
+
| `test_teacher_replay.py` (DPO-pair extraction) | 7 | ✅ all pass |
|
| 261 |
+
| `test_data_collator.py` (raw trace → batch) | 15 | ✅ all pass |
|
| 262 |
+
| `test_loss_composition_smoke.py` (3-channel + ablation + 5-step train) | 7 | ✅ all pass |
|
| 263 |
+
|
| 264 |
+
The composition smoke test runs all three channels on a `TinyLM(vocab=64, hidden=32)` (~10K parameters). Verifies:
|
| 265 |
+
|
| 266 |
+
- **Ablation invariants:** `α=0, β=0` reduces exactly to GRPO; α-only adds SDPO; β-only adds DPO; full = sum.
|
| 267 |
+
- **Gradient finiteness:** Every model parameter receives a finite gradient with all three channels active.
|
| 268 |
+
- **Training behavior:** A 5-step train run with all three channels active *decreases* total loss (overfitting check on a fixed batch). The channels do not actively fight each other.
|
| 269 |
+
- **Robustness to mixed batches:** When the data collator emits no SDPO inputs (no error sites in batch), the loss correctly bypasses the SDPO term even with α=1.
|
| 270 |
+
|
| 271 |
+
This empirically tests the integration claim that was purely architectural in earlier draft material.
|
| 272 |
+
|
| 273 |
+
### 5.3 What spike 005 does NOT prove
|
| 274 |
+
|
| 275 |
+
The smoke test is on a 10K-parameter model with placeholder GRPO loss (cross-entropy on a synthetic target sequence) instead of the real GRPO advantage / group-relative computation. It demonstrates **wiring correctness**, not training quality. The real training experiments are in spikes 002–004.
|
| 276 |
+
|
| 277 |
+
## 6. Implementation
|
| 278 |
+
|
| 279 |
+
### 6.1 Lifted SDPO loss
|
| 280 |
+
|
| 281 |
+
Verbatim port from `siyan-zhao/OPSD` (DeepWiki-verified self-contained, MIT licensed):
|
| 282 |
+
|
| 283 |
+
```python
|
| 284 |
+
def generalized_jsd_loss(
|
| 285 |
+
student_logits: torch.Tensor, # (B, T, V)
|
| 286 |
+
teacher_logits: torch.Tensor, # (B, T, V) — same model, hint-conditioned context
|
| 287 |
+
labels: torch.Tensor | None = None, # -100 = ignore
|
| 288 |
+
beta: float = 0.5, # 0=fwd KL, 1=rev KL, 0.5=JSD
|
| 289 |
+
temperature: float = 1.0,
|
| 290 |
+
reduction: str = "batchmean",
|
| 291 |
+
top_k: int | None = None,
|
| 292 |
+
token_clip: float | None = None,
|
| 293 |
+
) -> torch.Tensor: ...
|
| 294 |
+
```
|
| 295 |
+
|
| 296 |
+
Full implementation at [`spikes/005-integrated-trainer-skeleton/opsd_loss.py`](../spikes/005-integrated-trainer-skeleton/opsd_loss.py).
|
| 297 |
+
|
| 298 |
+
### 6.2 TRL trainer subclass
|
| 299 |
+
|
| 300 |
+
```python
|
| 301 |
+
class ComposerReplicationTrainer(GRPOTrainer):
|
| 302 |
+
def __init__(self, *args, alpha_sdpo=0.1, beta_replay=0.05, **kwargs):
|
| 303 |
+
super().__init__(*args, **kwargs)
|
| 304 |
+
self.alpha_sdpo = alpha_sdpo
|
| 305 |
+
self.beta_replay = beta_replay
|
| 306 |
+
|
| 307 |
+
def _compute_loss(self, model, inputs):
|
| 308 |
+
grpo_loss = super()._compute_loss(model, inputs)
|
| 309 |
+
sdpo_kl = self._compute_sdpo_loss(model, inputs)
|
| 310 |
+
replay_dpo = self._compute_trace_replay_loss(model, inputs)
|
| 311 |
+
return grpo_loss + self.alpha_sdpo * sdpo_kl + self.beta_replay * replay_dpo
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
Full implementation at [`spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`](../spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py).
|
| 315 |
+
|
| 316 |
+
### 6.3 VeRL custom advantage estimator
|
| 317 |
+
|
| 318 |
+
```python
|
| 319 |
+
@register_adv_est("grpo_composer")
|
| 320 |
+
def compute_grpo_composer_advantage(token_level_rewards, eos_mask, index, *,
|
| 321 |
+
sdpo_teacher_logprobs=None, alpha_sdpo=0.0,
|
| 322 |
+
teacher_consensus_prm=None, beta_replay=0.0,
|
| 323 |
+
**kwargs):
|
| 324 |
+
base_adv = core_algos.compute_grpo_outcome_advantage(token_level_rewards, eos_mask, index)
|
| 325 |
+
if alpha_sdpo and sdpo_teacher_logprobs is not None:
|
| 326 |
+
base_adv = base_adv + alpha_sdpo * (sdpo_teacher_logprobs - kwargs["old_log_prob"]) * kwargs["sdpo_error_mask"]
|
| 327 |
+
if beta_replay and teacher_consensus_prm is not None:
|
| 328 |
+
base_adv = base_adv + beta_replay * teacher_consensus_prm
|
| 329 |
+
return base_adv
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
Full implementation at [`spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py`](../spikes/005-integrated-trainer-skeleton/verl_path/composer_adv.py).
|
| 333 |
+
|
| 334 |
+
### 6.4 Data collator
|
| 335 |
+
|
| 336 |
+
`ComposerDataCollator` consumes a list of `TraceExample` dicts (with optional `dpo_pairs` from `teacher_replay.extract_dpo_pairs`) and emits the batch dict the trainer expects:
|
| 337 |
+
|
| 338 |
+
- Channel 1: `input_ids`, `attention_mask`, `response_mask`, `rewards`
|
| 339 |
+
- Channel 2: `ctx_teacher_input_ids` (with hint inserted at error-turn boundary), `sdpo_loss_mask` (1 at post-hint tokens, -100 elsewhere)
|
| 340 |
+
- Channel 3: `dpo_chosen_input_ids`, `dpo_rejected_input_ids`, `*_response_mask`
|
| 341 |
+
|
| 342 |
+
Full implementation at [`spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py`](../spikes/005-integrated-trainer-skeleton/trl_path/data_collator.py).
|
| 343 |
+
|
| 344 |
+
## 7. What's NOT proven (and how the follow-up spikes will measure it)
|
| 345 |
+
|
| 346 |
+
This paper is explicit about the gap between "framework + economic feasibility" and "this method works." The following claims **are not** yet validated:
|
| 347 |
+
|
| 348 |
+
| Claim | Status | Validating spike |
|
| 349 |
+
|---|---|---|
|
| 350 |
+
| TR-DPO improves SWE-bench-lite pass@1 over plain GRPO | Open | **Spike 004**: A/B Qwen3-7B trained with GRPO vs GRPO+TR-DPO; ≥2 pt pass@1 improvement at p<0.05 over 3 seeds is the success criterion. ~$300 GPU + $50 eval. |
|
| 351 |
+
| Teacher disagreement at the step level carries non-trivial preference signal on real traces | Open | **Spike 003**: extract DPO pairs from spike-002 traces; report pairs/trace, KL distance from random pairs. |
|
| 352 |
+
| TRL's GRPOTrainer with `environment_factory=` cleanly emits trace JSONL | Open | **Spike 002a**: 100 rollouts on Qwen3-7B + SWE-bench-lite via TRL. |
|
| 353 |
+
| PRIME-RL produces equivalently clean trace export | Open | **Spike 002b**: head-to-head with 002a. |
|
| 354 |
+
| The trained variant matches Composer 2.5 quality at 32B scale | Open (v0.1) | Out of scope for this paper. v0.1 follow-up. |
|
| 355 |
+
|
| 356 |
+
**Honest framing for what this paper does and does not contribute:**
|
| 357 |
+
|
| 358 |
+
- ✅ **Contributes:** an integration architecture verified at the code level, with reusable components, public code, and economic feasibility for the novel channel. A reviewer can use this paper to design and budget a real experiment.
|
| 359 |
+
- ❌ **Does not contribute:** evidence that the method actually trains better models. That requires the GPU spike. We are deliberately not making that claim until experiments back it.
|
| 360 |
+
|
| 361 |
+
## 8. Reward-hacking safeguards (proposed for v0.1)
|
| 362 |
+
|
| 363 |
+
Cursor's blog mentions specific reward-hacking failures (Java bytecode decompilation, Python type-cache reverse-engineering) without disclosing mitigations. We propose:
|
| 364 |
+
|
| 365 |
+
- **Sandbox hardening** — disable `find`, `unzip`, `strings`, `objdump`, and similar introspection tools in the OpenEnv container; clear `__pycache__` between rollouts; `PYTHONHASHSEED` randomized.
|
| 366 |
+
- **Static-analysis monitor** — flag rollouts that read or write paths matching `__pycache__/`, `.pyc`, `*.class` files, or unzip operations.
|
| 367 |
+
- **Reward-model penalty** — train a small RM on annotated reward-hacking examples; subtract its score from the RLVR signal.
|
| 368 |
+
|
| 369 |
+
These are pre-experiment proposals; their efficacy is part of the v0.1 follow-up.
|
| 370 |
+
|
| 371 |
+
## 9. Limitations
|
| 372 |
+
|
| 373 |
+
- **Single-snapshot research.** All upstream literature surveyed on 2026-05-25. The ecosystem moves fast: Forge may un-pause; OpenEnv may fork; PRIME-RL may consolidate. Re-survey every 6 months.
|
| 374 |
+
- **No primary access to Cursor's pipeline.** All Composer 2.5 details come from the public blog. Critical gaps (hint generator architecture, exact RL algorithm) remain.
|
| 375 |
+
- **Trace-replay novelty claim is weak in negation.** Absence of evidence in the surveyed literature is not strong evidence of absence. We may have missed adjacent work (e.g., on the "rich-feedback RLVR" axis the SDPO paper introduced).
|
| 376 |
+
- **Economic feasibility ≠ training-improvement evidence.** Spike 001 establishes the method is *affordable*. Whether it produces better models is the spike-004 question.
|
| 377 |
+
- **Pre-experimental publication risks.** Releasing a methodology before experimental validation has known failure modes (other groups racing the experiment; methodology errors caught only after publication). We accept this risk in exchange for early-feedback signal from the community.
|
| 378 |
+
|
| 379 |
+
## 10. Conclusion
|
| 380 |
+
|
| 381 |
+
We presented a methodology and integration architecture for replicating Cursor's Composer 2.5 recipe on an open-source stack, plus a novel multi-teacher trace-replay distillation channel. We grounded the Composer-2.5 mechanism in published prior art (OPSD/SDPO with MIT-licensed code), audited integration extension points across TRL/VeRL/OpenEnv with primary-source verification, and empirically validated the economic floor of the novel channel ($0.98/trace, 5× cap headroom) plus the architectural composition claim (38/38 unit tests, 5-step train decreases loss).
|
| 382 |
+
|
| 383 |
+
The full empirical validation — does TR-DPO actually improve SWE-bench-lite pass@1 over plain GRPO at 7B? — is the subject of the follow-up paper.
|
| 384 |
+
|
| 385 |
+
We invite collaboration. The repository is public. The integration architecture is component-friendly: any of the three channels can be ablated or substituted independently. If you're working on agentic-coding RL post-training and want to test additional channels (e.g., a fourth using Mixture-of-Agents-style aggregation, or test-time compute), the framework slots that in by adding another loss term with its own weight.
|
| 386 |
+
|
| 387 |
+
## Acknowledgements
|
| 388 |
+
|
| 389 |
+
This work builds on:
|
| 390 |
+
|
| 391 |
+
- **Cursor team** for the Composer 2.5 release and its three named training innovations.
|
| 392 |
+
- **Siyan Zhao et al.** for OPSD and the open `siyan-zhao/OPSD` reference implementation (MIT).
|
| 393 |
+
- **Hübotter et al.** for SDPO, which formalizes the Cursor mechanism.
|
| 394 |
+
- **HuggingFace TRL team** for `GRPOTrainer` and the OpenEnv integration.
|
| 395 |
+
- **ByteDance VeRL team** for the HybridFlow / 3D-HybridEngine architecture.
|
| 396 |
+
- **Meta PyTorch team** for Monarch + OpenEnv.
|
| 397 |
+
- **Prime Intellect team** for PRIME-RL and the INTELLECT-2 decentralized run.
|
| 398 |
+
- The five LLM-family research notes in [`research/`](../research/) (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Claude Sonnet 4.6, Kimi K2-Thinking) for cross-family verification of the framework synthesis.
|
| 399 |
+
|
| 400 |
+
## Citation
|
| 401 |
+
|
| 402 |
+
If you build on this framework, cite as:
|
| 403 |
+
|
| 404 |
+
```bibtex
|
| 405 |
+
@misc{composer-replication-framework-2026,
|
| 406 |
+
author = {Codeseys},
|
| 407 |
+
title = {Composer 2.5 Replication Framework: Methodology and Integration Architecture for Open Replication of Cursor's Agentic Coding Recipe},
|
| 408 |
+
year = {2026},
|
| 409 |
+
publisher = {HuggingFace},
|
| 410 |
+
howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}},
|
| 411 |
+
note = {Pre-experimental v0.0 draft. Methodology, integration architecture, and economic-feasibility result. Empirical validation in follow-up paper.}
|
| 412 |
+
}
|
| 413 |
+
```
|
| 414 |
+
|
| 415 |
+
Underlying primary sources:
|
| 416 |
+
|
| 417 |
+
```bibtex
|
| 418 |
+
@article{cursor2026composer25,
|
| 419 |
+
title = {Introducing Composer 2.5},
|
| 420 |
+
author = {{Cursor Team}},
|
| 421 |
+
year = {2026},
|
| 422 |
+
url = {https://cursor.com/blog/composer-2-5}
|
| 423 |
+
}
|
| 424 |
+
|
| 425 |
+
@article{zhao2026opsd,
|
| 426 |
+
title = {Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models},
|
| 427 |
+
author = {Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya},
|
| 428 |
+
year = {2026},
|
| 429 |
+
journal = {arXiv preprint arXiv:2601.18734}
|
| 430 |
+
}
|
| 431 |
+
|
| 432 |
+
@article{hubotter2026sdpo,
|
| 433 |
+
title = {Reinforcement Learning via Self-Distillation},
|
| 434 |
+
author = {H{\"u}botter, Jonas and L{\"u}beck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Buening, Thomas Kleine and Guestrin, Carlos and Krause, Andreas},
|
| 435 |
+
year = {2026},
|
| 436 |
+
journal = {arXiv preprint arXiv:2601.20802},
|
| 437 |
+
note = {ICLR 2026 Scaling Post-training Workshop}
|
| 438 |
+
}
|
| 439 |
+
```
|
| 440 |
+
|
| 441 |
+
## Repository
|
| 442 |
+
|
| 443 |
+
Methodology document, audited research notes, integration architecture, working code skeleton, spike plan, and all supporting artifacts:
|
| 444 |
+
|
| 445 |
+
**🤗 https://huggingface.co/Codeseys/composer-replication-framework**
|
| 446 |
+
|
| 447 |
+
Discussion: open a [Discussion](https://huggingface.co/Codeseys/composer-replication-framework/discussions) on the repo for technical questions, corrections, or collaboration interest.
|
| 448 |
+
|
| 449 |
+
---
|
| 450 |
+
|
| 451 |
+
*Last revised 2026-05-25 (Wave 4: data collator + loss composition smoke).*
|
| 452 |
+
*This is a living document. v0.1 will incorporate spike 002–004 results.*
|
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Publications
|
| 2 |
+
|
| 3 |
+
> **Pre-experimental release materials, drafted 2026-05-25, not yet posted publicly.**
|
| 4 |
+
> Use [`RELEASE_CHECKLIST.md`](RELEASE_CHECKLIST.md) to coordinate the publication wave when ready to ship.
|
| 5 |
+
|
| 6 |
+
| Artifact | What | Where it goes |
|
| 7 |
+
|---|---|---|
|
| 8 |
+
| [`PAPER_v0.md`](PAPER_v0.md) | Longform methodology paper (~6,500 words) — central document | arXiv (eventually) or just as the canonical writeup on the repo |
|
| 9 |
+
| [`BLOG_POST.md`](BLOG_POST.md) | Blog post (~2,400 words) in HuggingFace Blog markdown format | HuggingFace blog PR + personal blog / Substack / Medium |
|
| 10 |
+
| [`HF_DISCUSSION_POST.md`](HF_DISCUSSION_POST.md) | Repo Community-tab discussion announcing the release | This repo's [Discussions tab](https://huggingface.co/Codeseys/composer-replication-framework/discussions) |
|
| 11 |
+
| [`TWITTER_THREAD.md`](TWITTER_THREAD.md) | 13-tweet thread, 5-tweet short version, LinkedIn variant | X / Twitter / LinkedIn |
|
| 12 |
+
| [`RELEASE_CHECKLIST.md`](RELEASE_CHECKLIST.md) | Pre-flight checklist + sequencing recommendation + risk register | Internal coordination |
|
| 13 |
+
| [`/CITATION.cff`](../CITATION.cff) | Citation File Format — HF/GitHub renders a "Cite this repository" UI from this | Repo root |
|
| 14 |
+
| [`/CITATION.bib`](../CITATION.bib) | BibTeX equivalent | Repo root |
|
| 15 |
+
|
| 16 |
+
## What this collection is and isn't
|
| 17 |
+
|
| 18 |
+
**It is:** a complete, self-consistent draft of a pre-experimental release announcing the methodology, integration architecture, OPSD/SDPO framing, the novel TR-DPO channel, and the spike-001/spike-005 results. Every claim is either upstream-citation-backed or empirically validated by the spikes.
|
| 19 |
+
|
| 20 |
+
**It isn't:** post-experimental. There are no training results yet. Spike 002–004 (~$500 GPU + a few weeks of wallclock) are the gate to a v0.1 release that adds empirical training validation.
|
| 21 |
+
|
| 22 |
+
## Honest framing reused throughout
|
| 23 |
+
|
| 24 |
+
All four publication-facing documents (`PAPER_v0.md`, `BLOG_POST.md`, `HF_DISCUSSION_POST.md`, `TWITTER_THREAD.md`) include explicit "what I'm NOT claiming" sections. That framing is the publication's defense against overclaim — the work being released is methodology, integration architecture, and economic feasibility for the novel channel, not "this method works."
|
| 25 |
+
|
| 26 |
+
If anything in those documents reads as if it claims more than that, edit before posting.
|
| 27 |
+
|
| 28 |
+
## Sequencing TL;DR
|
| 29 |
+
|
| 30 |
+
1. HF Discussion post (lowest stakes; pre-announces the methodology)
|
| 31 |
+
2. Blog post (anchor narrative)
|
| 32 |
+
3. X / LinkedIn (after blog URL exists)
|
| 33 |
+
4. arXiv (defer until v0.1 with empirical results — see `RELEASE_CHECKLIST.md`)
|
| 34 |
+
|
| 35 |
+
Three-day gap between (1) and (2) lets early-feedback iterations land before the bigger announcement.
|
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Publication Release Checklist
|
| 2 |
+
|
| 3 |
+
> **Last updated:** 2026-05-25
|
| 4 |
+
> **Current state:** all materials drafted; nothing posted publicly yet.
|
| 5 |
+
> Use this checklist to coordinate the publication wave when ready to ship.
|
| 6 |
+
|
| 7 |
+
## What's drafted
|
| 8 |
+
|
| 9 |
+
| Artifact | Path | Status | Word count (approx) |
|
| 10 |
+
|---|---|---|---|
|
| 11 |
+
| Longform methodology paper | [`publications/PAPER_v0.md`](PAPER_v0.md) | ✅ DRAFTED | ~6,500 |
|
| 12 |
+
| Blog post (HF Blog format) | [`publications/BLOG_POST.md`](BLOG_POST.md) | ✅ DRAFTED | ~2,400 |
|
| 13 |
+
| HF Discussion thread (repo Community tab) | [`publications/HF_DISCUSSION_POST.md`](HF_DISCUSSION_POST.md) | ✅ DRAFTED | ~700 |
|
| 14 |
+
| Twitter / X thread (13-tweet + 5-tweet + LinkedIn variants) | [`publications/TWITTER_THREAD.md`](TWITTER_THREAD.md) | ✅ DRAFTED | ~1,200 |
|
| 15 |
+
| `CITATION.cff` (HF/GitHub Citation Format) | [`/CITATION.cff`](../CITATION.cff) | ✅ DRAFTED | n/a |
|
| 16 |
+
| `CITATION.bib` (BibTeX) | [`/CITATION.bib`](../CITATION.bib) | ✅ DRAFTED | n/a |
|
| 17 |
+
| Repo README (model card with frontmatter) | [`/README.md`](../README.md) | ✅ Already published (v3 with wave 4 status) | ~1,000 |
|
| 18 |
+
|
| 19 |
+
All draft materials are in `publications/` and **not yet posted**. Nothing is gated by review; everything is a self-publish decision. Ready to ship.
|
| 20 |
+
|
| 21 |
+
## Pre-flight check before shipping any of these
|
| 22 |
+
|
| 23 |
+
These items should be confirmed before posting any of the public-facing materials. Most are already done from earlier waves but listing here for completeness:
|
| 24 |
+
|
| 25 |
+
- [x] HF repo is public (`Codeseys/composer-replication-framework`)
|
| 26 |
+
- [x] All linked URLs resolve (cross-checked during drafts)
|
| 27 |
+
- [x] Test suite passes (`38/38` as of wave 4)
|
| 28 |
+
- [x] Spike 001 is reproducible (deterministic states + recorded results)
|
| 29 |
+
- [x] Cursor blog is correctly summarized (audit notice in `research/01-composer-2.5.md`)
|
| 30 |
+
- [x] Upstream papers cited correctly (OPSD, SDPO, Cursor blog with arXiv IDs verified)
|
| 31 |
+
- [x] License is MIT and consistent across `LICENSE` + `README.md` frontmatter + `CITATION.cff`
|
| 32 |
+
- [ ] **`CITATION.cff` author block updated with real name/ORCID** if desired (currently just "Codeseys")
|
| 33 |
+
- [ ] **Choose final author identity** for the byline (Codeseys handle? real name? affiliation?)
|
| 34 |
+
- [ ] **HF Discussion title / tags chosen** — suggested in `HF_DISCUSSION_POST.md`
|
| 35 |
+
- [ ] **Blog thumbnail prepared** — placeholder path in `BLOG_POST.md` frontmatter (`/blog/assets/composer-replication-framework/thumbnail.png`); needs a real image
|
| 36 |
+
- [ ] **arXiv submission decided** — see § "arXiv submission" below
|
| 37 |
+
|
| 38 |
+
## Sequencing recommendation
|
| 39 |
+
|
| 40 |
+
If publishing all materials, this order minimizes risk and maximizes signal:
|
| 41 |
+
|
| 42 |
+
1. **HF Discussion post first** (lowest-stakes — repo Community tab; anyone landing on the repo will see it; it pre-announces the methodology paper).
|
| 43 |
+
2. **Blog post / personal site second** (anchor narrative, ~2,400 words, easy to share).
|
| 44 |
+
3. **X / LinkedIn third** (after the blog post URL exists to anchor the thread).
|
| 45 |
+
4. **arXiv submission last** (if doing this — needs more polish; see below).
|
| 46 |
+
|
| 47 |
+
Three-day gap between (1) and (2) is reasonable to let the discussion post collect any early feedback that should be incorporated into the blog.
|
| 48 |
+
|
| 49 |
+
## Distribution / amplification ideas
|
| 50 |
+
|
| 51 |
+
- Cross-post the blog to:
|
| 52 |
+
- HuggingFace blog (PR against `huggingface/blog` repo). Their submission process is documented at https://huggingface.co/docs/hub/en/blog
|
| 53 |
+
- Personal blog / Substack / Medium
|
| 54 |
+
- Post the discussion in:
|
| 55 |
+
- r/LocalLLaMA (will be eaten by their algorithm but worth one shot)
|
| 56 |
+
- r/MachineLearning if you tag `[R]` and frame as "novel methodology, no results yet — looking for feedback"
|
| 57 |
+
- HackerNews "Show HN: …" — pre-experimental disclosure should be in the title
|
| 58 |
+
- LessWrong / Alignment Forum if you frame the reward-hacking section as the lead
|
| 59 |
+
- Tag in the Twitter thread:
|
| 60 |
+
- `@cursor_ai` (Cursor team)
|
| 61 |
+
- `@huggingface` (TRL team)
|
| 62 |
+
- `@volcanoengine` (VeRL team)
|
| 63 |
+
- `@MoonshotAI` (Kimi K2.5)
|
| 64 |
+
- `@PrimeIntellect`
|
| 65 |
+
|
| 66 |
+
## arXiv submission (decide later)
|
| 67 |
+
|
| 68 |
+
The methodology paper is currently in markdown. Pros and cons of a formal arXiv release:
|
| 69 |
+
|
| 70 |
+
**Pros**
|
| 71 |
+
- Citable DOI; appears in Google Scholar / Semantic Scholar
|
| 72 |
+
- Reaches a non-HF research audience
|
| 73 |
+
- Forces a higher polish bar, which catches errors
|
| 74 |
+
|
| 75 |
+
**Cons**
|
| 76 |
+
- Needs LaTeX conversion (~1 day of formatting work)
|
| 77 |
+
- The "no experimental results yet" framing is unusual for arXiv; reviewers may dismiss
|
| 78 |
+
- Once posted, it's permanent — corrections live as v2/v3 markers
|
| 79 |
+
|
| 80 |
+
**Recommendation:** post the HF blog and discussion first; decide on arXiv only after spike 002–004 produce results. Then make it a v0.1 paper *with* experimental backing. The current methodology paper becomes Section 2–4 of that future paper, with new sections 5+ for the empirical results.
|
| 81 |
+
|
| 82 |
+
If you do submit to arXiv now anyway: cs.LG primary, cs.AI cross-list. Title same as `PAPER_v0.md`. Abstract from the paper. Frame in the comments section as "pre-experimental methodology release; experimental validation in follow-up."
|
| 83 |
+
|
| 84 |
+
## Embargo / coordination notes
|
| 85 |
+
|
| 86 |
+
- **Cursor team coordination:** not strictly required (their blog is public, their cited papers are public, no proprietary info), but a polite heads-up tweet on day-of release is reasonable since the post heavily engages their work. `@cursor_ai` tag on tweet 1 of the X thread.
|
| 87 |
+
- **OPSD authors coordination:** Siyan Zhao et al. — also not required (MIT code, public paper) but tagging the lead author on the X thread is a polite signal of citation. Their handles: try `@siyan_zhao` (verify before tagging).
|
| 88 |
+
- **SDPO authors coordination:** same — Hübotter et al. lead author handles unverified, skip tagging if not findable.
|
| 89 |
+
|
| 90 |
+
## Risk register
|
| 91 |
+
|
| 92 |
+
| Risk | Likelihood | Mitigation |
|
| 93 |
+
|---|---|---|
|
| 94 |
+
| Someone runs spike 004 first and beats us to publication | Medium | Acknowledged. Trade-off accepted. The integration architecture is independently citable. |
|
| 95 |
+
| Methodology error caught after publication | Medium | Drafts have been audited (DeepWiki for code, primary-source-read for Cursor blog). 38 unit tests catch wiring bugs. The "what's NOT proven" section in the paper is explicit about open claims. |
|
| 96 |
+
| Hostile read claiming we overclaim novelty | Low | The paper explicitly compares to rStar / Math-Shepherd / Magpie / MoA and concedes "absence of evidence is not evidence of absence" in §9. |
|
| 97 |
+
| Cursor team objects to characterization | Low | Everything cited from their public blog with explicit `[BLOG-VERIFIED]` tags. SDPO/OPSD framing is supported by their own footnote. |
|
| 98 |
+
| Repo gets a flood of PRs / discussion noise | Low | Welcome the noise. Maintain `CONTRIBUTING.md` (TBD) when traffic justifies. |
|
| 99 |
+
|
| 100 |
+
## Post-publication tracking (if you ship)
|
| 101 |
+
|
| 102 |
+
Things to monitor in the first 2 weeks after publication:
|
| 103 |
+
|
| 104 |
+
- HF repo: stars, forks, downloads (reachable via API)
|
| 105 |
+
- HF Discussions tab: new threads, especially anything flagging methodology errors
|
| 106 |
+
- X thread: replies from people working on TRL / VeRL / OpenEnv (especially extension-point critiques)
|
| 107 |
+
- Citations / mentions in adjacent posts (set up Google Scholar Alert)
|
| 108 |
+
- arXiv mentions (if any related work cites pre-print or blog)
|
| 109 |
+
|
| 110 |
+
If a methodology error surfaces, the response protocol:
|
| 111 |
+
1. Acknowledge in the Discussion thread within 24 hours.
|
| 112 |
+
2. Patch the affected file in the repo with a clear commit message.
|
| 113 |
+
3. Add an "Errata" section to `PAPER_v0.md` documenting what was wrong and what changed.
|
| 114 |
+
4. Don't try to silently rewrite history.
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
*Drafts ready. Ship when you decide. The repo is in a clean state to support any subset of the publication wave above.*
|
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# X / Twitter announcement thread (draft)
|
| 2 |
+
|
| 3 |
+
> **Posting suggestion:** start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
**1/13**
|
| 8 |
+
new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5
|
| 9 |
+
|
| 10 |
+
with a novel multi-teacher trace-replay channel + 38 passing tests
|
| 11 |
+
|
| 12 |
+
🤗 huggingface.co/Codeseys/composer-replication-framework
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
**2/13**
|
| 17 |
+
the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be **mathematically the same as published SDPO/OPSD**
|
| 18 |
+
|
| 19 |
+
Cursor cites the papers in footnote 1. there's already MIT-licensed code for it.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
**3/13**
|
| 24 |
+
mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass *with* hint = teacher logits, run *without* hint = student logits, KL-distill student → teacher *only at that turn*
|
| 25 |
+
|
| 26 |
+
same model is both teacher and student
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
**4/13**
|
| 31 |
+
this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step
|
| 32 |
+
|
| 33 |
+
OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD
|
| 34 |
+
SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
**5/13**
|
| 39 |
+
the novel addition: **multi-teacher trace-replay distillation (TR-DPO)**
|
| 40 |
+
|
| 41 |
+
after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro)
|
| 42 |
+
extract DPO pairs from teacher disagreement with student
|
| 43 |
+
add as a third reward channel
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
**6/13**
|
| 48 |
+
SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context.
|
| 49 |
+
|
| 50 |
+
they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources.
|
| 51 |
+
|
| 52 |
+
unified loss: grpo + α·sdpo + β·trace_replay
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
**7/13**
|
| 57 |
+
spike 001 (kill-switch) — does N-teacher replay break the budget?
|
| 58 |
+
|
| 59 |
+
150 real OpenRouter calls, 0 errors:
|
| 60 |
+
✅ $0.98 per 50-step trace (vs $5 cap)
|
| 61 |
+
✅ p95 step latency 20.5s (vs 30s cap)
|
| 62 |
+
✅ p99 latency 23.2s (vs 60s cap)
|
| 63 |
+
|
| 64 |
+
with VOI gating in v0.1: ~$0.30/trace projected
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
**8/13**
|
| 69 |
+
spike 005 — does the 3-channel integration actually compose?
|
| 70 |
+
|
| 71 |
+
```
|
| 72 |
+
$ pytest tests/ -v
|
| 73 |
+
============================== 38 passed in 3.43s ==============================
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
**9/13**
|
| 81 |
+
DeepWiki-verified extension points:
|
| 82 |
+
|
| 83 |
+
TRL → subclass `GRPOTrainer._compute_loss`
|
| 84 |
+
VeRL → `@register_adv_est("grpo_composer")` + DataProto fields
|
| 85 |
+
OPSD → lift `generalized_jsd_loss` static method directly
|
| 86 |
+
|
| 87 |
+
both paths shipped in spikes/005/.
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
**10/13**
|
| 92 |
+
what I'm NOT claiming yet:
|
| 93 |
+
|
| 94 |
+
❌ trace-replay actually improves training
|
| 95 |
+
❌ TRL+OpenEnv produces clean traces at scale
|
| 96 |
+
❌ this matches Composer 2.5 quality
|
| 97 |
+
|
| 98 |
+
those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
**11/13**
|
| 103 |
+
biggest meta-lesson: **read primary sources yourself**
|
| 104 |
+
|
| 105 |
+
initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog
|
| 106 |
+
|
| 107 |
+
going from "implement from scratch" → "lift MIT code" was that re-read
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
**12/13**
|
| 112 |
+
specifically asking for:
|
| 113 |
+
- critical reads of the integration architecture
|
| 114 |
+
- pointers to adjacent published work I missed on multi-teacher trace replay
|
| 115 |
+
- reward-hacking proposals for the Feature Deletion env
|
| 116 |
+
- collaborators with a small GPU budget who want to run spike 002–004
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
**13/13**
|
| 121 |
+
all artifacts public, MIT licensed:
|
| 122 |
+
|
| 123 |
+
🤗 huggingface.co/Codeseys/composer-replication-framework
|
| 124 |
+
|
| 125 |
+
methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan
|
| 126 |
+
|
| 127 |
+
discussions tab open. would love feedback before I burn GPU.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## Alternative shorter version (5 tweets, for low-bandwidth post)
|
| 132 |
+
|
| 133 |
+
**1/5**
|
| 134 |
+
released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests
|
| 135 |
+
|
| 136 |
+
🤗 huggingface.co/Codeseys/composer-replication-framework
|
| 137 |
+
|
| 138 |
+
pre-experimental — methodology and economic feasibility, no training results yet
|
| 139 |
+
|
| 140 |
+
**2/5**
|
| 141 |
+
key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available
|
| 142 |
+
|
| 143 |
+
cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly
|
| 144 |
+
|
| 145 |
+
**3/5**
|
| 146 |
+
novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO
|
| 147 |
+
|
| 148 |
+
economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1
|
| 149 |
+
|
| 150 |
+
**4/5**
|
| 151 |
+
integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits
|
| 152 |
+
|
| 153 |
+
three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts
|
| 154 |
+
|
| 155 |
+
5-step train run on tiny model decreases loss with all 3 channels active
|
| 156 |
+
|
| 157 |
+
**5/5**
|
| 158 |
+
what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU)
|
| 159 |
+
|
| 160 |
+
asking for: critical reads, adjacent-work pointers, collaboration interest
|
| 161 |
+
|
| 162 |
+
repo: huggingface.co/Codeseys/composer-replication-framework
|
| 163 |
+
license: MIT
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
## LinkedIn / longer-form variant (1 post)
|
| 168 |
+
|
| 169 |
+
> Excited to release **a pre-experimental methodology paper + working code skeleton** for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers.
|
| 170 |
+
>
|
| 171 |
+
> Three contributions:
|
| 172 |
+
>
|
| 173 |
+
> 1. **Audit of Cursor's recipe.** The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at `siyan-zhao/OPSD`. Cursor cites both papers in their blog's footnote 1.
|
| 174 |
+
>
|
| 175 |
+
> 2. **Novel reward channel.** Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts.
|
| 176 |
+
>
|
| 177 |
+
> 3. **Verified integration architecture.** DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural.
|
| 178 |
+
>
|
| 179 |
+
> What I'm explicitly *not* claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors.
|
| 180 |
+
>
|
| 181 |
+
> Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework
|
| 182 |
+
>
|
| 183 |
+
> Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget.
|
| 184 |
+
>
|
| 185 |
+
> #LLM #ReinforcementLearning #AgenticCoding #OpenSource
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
*All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.*
|