Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # X / Twitter announcement thread (draft) | |
| > **Posting suggestion:** start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix. | |
| --- | |
| **1/13** | |
| new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5 | |
| with a novel multi-teacher trace-replay channel + 38 passing tests | |
| 🤗 huggingface.co/Codeseys/composer-replication-framework | |
| --- | |
| **2/13** | |
| the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be **mathematically the same as published SDPO/OPSD** | |
| Cursor cites the papers in footnote 1. there's already MIT-licensed code for it. | |
| --- | |
| **3/13** | |
| mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass *with* hint = teacher logits, run *without* hint = student logits, KL-distill student → teacher *only at that turn* | |
| same model is both teacher and student | |
| --- | |
| **4/13** | |
| this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step | |
| OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD | |
| SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802 | |
| --- | |
| **5/13** | |
| the novel addition: **multi-teacher trace-replay distillation (TR-DPO)** | |
| after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro) | |
| extract DPO pairs from teacher disagreement with student | |
| add as a third reward channel | |
| --- | |
| **6/13** | |
| SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context. | |
| they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources. | |
| unified loss: grpo + α·sdpo + β·trace_replay | |
| --- | |
| **7/13** | |
| spike 001 (kill-switch) — does N-teacher replay break the budget? | |
| 150 real OpenRouter calls, 0 errors: | |
| ✅ $0.98 per 50-step trace (vs $5 cap) | |
| ✅ p95 step latency 20.5s (vs 30s cap) | |
| ✅ p99 latency 23.2s (vs 60s cap) | |
| with VOI gating in v0.1: ~$0.30/trace projected | |
| --- | |
| **8/13** | |
| spike 005 — does the 3-channel integration actually compose? | |
| ``` | |
| $ pytest tests/ -v | |
| ============================== 38 passed in 3.43s ============================== | |
| ``` | |
| includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight. | |
| --- | |
| **9/13** | |
| DeepWiki-verified extension points: | |
| TRL → subclass `GRPOTrainer._compute_loss` | |
| VeRL → `@register_adv_est("grpo_composer")` + DataProto fields | |
| OPSD → lift `generalized_jsd_loss` static method directly | |
| both paths shipped in spikes/005/. | |
| --- | |
| **10/13** | |
| what I'm NOT claiming yet: | |
| ❌ trace-replay actually improves training | |
| ❌ TRL+OpenEnv produces clean traces at scale | |
| ❌ this matches Composer 2.5 quality | |
| those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU. | |
| --- | |
| **11/13** | |
| biggest meta-lesson: **read primary sources yourself** | |
| initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog | |
| going from "implement from scratch" → "lift MIT code" was that re-read | |
| --- | |
| **12/13** | |
| specifically asking for: | |
| - critical reads of the integration architecture | |
| - pointers to adjacent published work I missed on multi-teacher trace replay | |
| - reward-hacking proposals for the Feature Deletion env | |
| - collaborators with a small GPU budget who want to run spike 002–004 | |
| --- | |
| **13/13** | |
| all artifacts public, MIT licensed: | |
| 🤗 huggingface.co/Codeseys/composer-replication-framework | |
| methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan | |
| discussions tab open. would love feedback before I burn GPU. | |
| --- | |
| ## Alternative shorter version (5 tweets, for low-bandwidth post) | |
| **1/5** | |
| released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests | |
| 🤗 huggingface.co/Codeseys/composer-replication-framework | |
| pre-experimental — methodology and economic feasibility, no training results yet | |
| **2/5** | |
| key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available | |
| cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly | |
| **3/5** | |
| novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO | |
| economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1 | |
| **4/5** | |
| integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits | |
| three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts | |
| 5-step train run on tiny model decreases loss with all 3 channels active | |
| **5/5** | |
| what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU) | |
| asking for: critical reads, adjacent-work pointers, collaboration interest | |
| repo: huggingface.co/Codeseys/composer-replication-framework | |
| license: MIT | |
| --- | |
| ## LinkedIn / longer-form variant (1 post) | |
| > Excited to release **a pre-experimental methodology paper + working code skeleton** for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers. | |
| > | |
| > Three contributions: | |
| > | |
| > 1. **Audit of Cursor's recipe.** The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at `siyan-zhao/OPSD`. Cursor cites both papers in their blog's footnote 1. | |
| > | |
| > 2. **Novel reward channel.** Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts. | |
| > | |
| > 3. **Verified integration architecture.** DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural. | |
| > | |
| > What I'm explicitly *not* claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors. | |
| > | |
| > Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework | |
| > | |
| > Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget. | |
| > | |
| > #LLM #ReinforcementLearning #AgenticCoding #OpenSource | |
| --- | |
| *All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.* | |