# X / Twitter announcement thread (draft) > **Posting suggestion:** start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix. --- **1/13** new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5 with a novel multi-teacher trace-replay channel + 38 passing tests 🤗 huggingface.co/Codeseys/composer-replication-framework --- **2/13** the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be **mathematically the same as published SDPO/OPSD** Cursor cites the papers in footnote 1. there's already MIT-licensed code for it. --- **3/13** mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass *with* hint = teacher logits, run *without* hint = student logits, KL-distill student → teacher *only at that turn* same model is both teacher and student --- **4/13** this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802 --- **5/13** the novel addition: **multi-teacher trace-replay distillation (TR-DPO)** after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro) extract DPO pairs from teacher disagreement with student add as a third reward channel --- **6/13** SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context. they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources. unified loss: grpo + α·sdpo + β·trace_replay --- **7/13** spike 001 (kill-switch) — does N-teacher replay break the budget? 150 real OpenRouter calls, 0 errors: ✅ $0.98 per 50-step trace (vs $5 cap) ✅ p95 step latency 20.5s (vs 30s cap) ✅ p99 latency 23.2s (vs 60s cap) with VOI gating in v0.1: ~$0.30/trace projected --- **8/13** spike 005 — does the 3-channel integration actually compose? ``` $ pytest tests/ -v ============================== 38 passed in 3.43s ============================== ``` includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight. --- **9/13** DeepWiki-verified extension points: TRL → subclass `GRPOTrainer._compute_loss` VeRL → `@register_adv_est("grpo_composer")` + DataProto fields OPSD → lift `generalized_jsd_loss` static method directly both paths shipped in spikes/005/. --- **10/13** what I'm NOT claiming yet: ❌ trace-replay actually improves training ❌ TRL+OpenEnv produces clean traces at scale ❌ this matches Composer 2.5 quality those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU. --- **11/13** biggest meta-lesson: **read primary sources yourself** initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog going from "implement from scratch" → "lift MIT code" was that re-read --- **12/13** specifically asking for: - critical reads of the integration architecture - pointers to adjacent published work I missed on multi-teacher trace replay - reward-hacking proposals for the Feature Deletion env - collaborators with a small GPU budget who want to run spike 002–004 --- **13/13** all artifacts public, MIT licensed: 🤗 huggingface.co/Codeseys/composer-replication-framework methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan discussions tab open. would love feedback before I burn GPU. --- ## Alternative shorter version (5 tweets, for low-bandwidth post) **1/5** released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests 🤗 huggingface.co/Codeseys/composer-replication-framework pre-experimental — methodology and economic feasibility, no training results yet **2/5** key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly **3/5** novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1 **4/5** integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts 5-step train run on tiny model decreases loss with all 3 channels active **5/5** what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU) asking for: critical reads, adjacent-work pointers, collaboration interest repo: huggingface.co/Codeseys/composer-replication-framework license: MIT --- ## LinkedIn / longer-form variant (1 post) > Excited to release **a pre-experimental methodology paper + working code skeleton** for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers. > > Three contributions: > > 1. **Audit of Cursor's recipe.** The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at `siyan-zhao/OPSD`. Cursor cites both papers in their blog's footnote 1. > > 2. **Novel reward channel.** Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts. > > 3. **Verified integration architecture.** DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural. > > What I'm explicitly *not* claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors. > > Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework > > Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget. > > #LLM #ReinforcementLearning #AgenticCoding #OpenSource --- *All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.*