Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
X / Twitter announcement thread (draft)
Posting suggestion: start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix.
1/13 new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5
with a novel multi-teacher trace-replay channel + 38 passing tests
🤗 huggingface.co/Codeseys/composer-replication-framework
2/13 the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be mathematically the same as published SDPO/OPSD
Cursor cites the papers in footnote 1. there's already MIT-licensed code for it.
3/13 mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass with hint = teacher logits, run without hint = student logits, KL-distill student → teacher only at that turn
same model is both teacher and student
4/13 this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step
OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802
5/13 the novel addition: multi-teacher trace-replay distillation (TR-DPO)
after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro) extract DPO pairs from teacher disagreement with student add as a third reward channel
6/13 SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context.
they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources.
unified loss: grpo + α·sdpo + β·trace_replay
7/13 spike 001 (kill-switch) — does N-teacher replay break the budget?
150 real OpenRouter calls, 0 errors: ✅ $0.98 per 50-step trace (vs $5 cap) ✅ p95 step latency 20.5s (vs 30s cap) ✅ p99 latency 23.2s (vs 60s cap)
with VOI gating in v0.1: ~$0.30/trace projected
8/13 spike 005 — does the 3-channel integration actually compose?
$ pytest tests/ -v
============================== 38 passed in 3.43s ==============================
includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight.
9/13 DeepWiki-verified extension points:
TRL → subclass GRPOTrainer._compute_loss
VeRL → @register_adv_est("grpo_composer") + DataProto fields
OPSD → lift generalized_jsd_loss static method directly
both paths shipped in spikes/005/.
10/13 what I'm NOT claiming yet:
❌ trace-replay actually improves training ❌ TRL+OpenEnv produces clean traces at scale ❌ this matches Composer 2.5 quality
those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU.
11/13 biggest meta-lesson: read primary sources yourself
initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog
going from "implement from scratch" → "lift MIT code" was that re-read
12/13 specifically asking for:
- critical reads of the integration architecture
- pointers to adjacent published work I missed on multi-teacher trace replay
- reward-hacking proposals for the Feature Deletion env
- collaborators with a small GPU budget who want to run spike 002–004
13/13 all artifacts public, MIT licensed:
🤗 huggingface.co/Codeseys/composer-replication-framework
methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan
discussions tab open. would love feedback before I burn GPU.
Alternative shorter version (5 tweets, for low-bandwidth post)
1/5 released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests
🤗 huggingface.co/Codeseys/composer-replication-framework
pre-experimental — methodology and economic feasibility, no training results yet
2/5 key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available
cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly
3/5 novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO
economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1
4/5 integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits
three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts
5-step train run on tiny model decreases loss with all 3 channels active
5/5 what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU)
asking for: critical reads, adjacent-work pointers, collaboration interest
repo: huggingface.co/Codeseys/composer-replication-framework license: MIT
LinkedIn / longer-form variant (1 post)
Excited to release a pre-experimental methodology paper + working code skeleton for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers.
Three contributions:
Audit of Cursor's recipe. The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at
siyan-zhao/OPSD. Cursor cites both papers in their blog's footnote 1.Novel reward channel. Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts.
Verified integration architecture. DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural.
What I'm explicitly not claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors.
Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework
Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget.
#LLM #ReinforcementLearning #AgenticCoding #OpenSource
All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.