# X / Twitter announcement thread (draft)

> **Posting suggestion:** start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix.

---

**1/13**
new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5

with a novel multi-teacher trace-replay channel + 38 passing tests

🤗 huggingface.co/Codeseys/composer-replication-framework

---

**2/13**
the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be **mathematically the same as published SDPO/OPSD**

Cursor cites the papers in footnote 1. there's already MIT-licensed code for it.

---

**3/13**
mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass *with* hint = teacher logits, run *without* hint = student logits, KL-distill student → teacher *only at that turn*

same model is both teacher and student

---

**4/13**
this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step

OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD
SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802

---

**5/13**
the novel addition: **multi-teacher trace-replay distillation (TR-DPO)**

after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro)
extract DPO pairs from teacher disagreement with student
add as a third reward channel

---

**6/13**
SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context.

they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources.

unified loss: grpo + α·sdpo + β·trace_replay

---

**7/13**
spike 001 (kill-switch) — does N-teacher replay break the budget?

150 real OpenRouter calls, 0 errors:
✅ $0.98 per 50-step trace (vs $5 cap)
✅ p95 step latency 20.5s (vs 30s cap)
✅ p99 latency 23.2s (vs 60s cap)

with VOI gating in v0.1: ~$0.30/trace projected

---

**8/13**
spike 005 — does the 3-channel integration actually compose?

```
$ pytest tests/ -v
============================== 38 passed in 3.43s ==============================
```

includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight.

---

**9/13**
DeepWiki-verified extension points:

TRL → subclass `GRPOTrainer._compute_loss`
VeRL → `@register_adv_est("grpo_composer")` + DataProto fields
OPSD → lift `generalized_jsd_loss` static method directly

both paths shipped in spikes/005/.

---

**10/13**
what I'm NOT claiming yet:

❌ trace-replay actually improves training
❌ TRL+OpenEnv produces clean traces at scale
❌ this matches Composer 2.5 quality

those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU.

---

**11/13**
biggest meta-lesson: **read primary sources yourself**

initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog

going from "implement from scratch" → "lift MIT code" was that re-read

---

**12/13**
specifically asking for:
- critical reads of the integration architecture
- pointers to adjacent published work I missed on multi-teacher trace replay
- reward-hacking proposals for the Feature Deletion env
- collaborators with a small GPU budget who want to run spike 002–004

---

**13/13**
all artifacts public, MIT licensed:

🤗 huggingface.co/Codeseys/composer-replication-framework

methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan

discussions tab open. would love feedback before I burn GPU.

---

## Alternative shorter version (5 tweets, for low-bandwidth post)

**1/5**
released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests

🤗 huggingface.co/Codeseys/composer-replication-framework

pre-experimental — methodology and economic feasibility, no training results yet

**2/5**
key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available

cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly

**3/5**
novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO

economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1

**4/5**
integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits

three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts

5-step train run on tiny model decreases loss with all 3 channels active

**5/5**
what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU)

asking for: critical reads, adjacent-work pointers, collaboration interest

repo: huggingface.co/Codeseys/composer-replication-framework
license: MIT

---

## LinkedIn / longer-form variant (1 post)

> Excited to release **a pre-experimental methodology paper + working code skeleton** for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers.
>
> Three contributions:
>
> 1. **Audit of Cursor's recipe.** The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at `siyan-zhao/OPSD`. Cursor cites both papers in their blog's footnote 1.
>
> 2. **Novel reward channel.** Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts.
>
> 3. **Verified integration architecture.** DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural.
>
> What I'm explicitly *not* claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors.
>
> Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework
>
> Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget.
>
> #LLM #ReinforcementLearning #AgenticCoding #OpenSource

---

*All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.*