baladithyab
Wave 5: full publication-materials drafts (pre-experimental release set)
639a760
# X / Twitter announcement thread (draft)
> **Posting suggestion:** start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix.
---
**1/13**
new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5
with a novel multi-teacher trace-replay channel + 38 passing tests
🤗 huggingface.co/Codeseys/composer-replication-framework
---
**2/13**
the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be **mathematically the same as published SDPO/OPSD**
Cursor cites the papers in footnote 1. there's already MIT-licensed code for it.
---
**3/13**
mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass *with* hint = teacher logits, run *without* hint = student logits, KL-distill student → teacher *only at that turn*
same model is both teacher and student
---
**4/13**
this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step
OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD
SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802
---
**5/13**
the novel addition: **multi-teacher trace-replay distillation (TR-DPO)**
after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro)
extract DPO pairs from teacher disagreement with student
add as a third reward channel
---
**6/13**
SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context.
they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources.
unified loss: grpo + α·sdpo + β·trace_replay
---
**7/13**
spike 001 (kill-switch) — does N-teacher replay break the budget?
150 real OpenRouter calls, 0 errors:
✅ $0.98 per 50-step trace (vs $5 cap)
✅ p95 step latency 20.5s (vs 30s cap)
✅ p99 latency 23.2s (vs 60s cap)
with VOI gating in v0.1: ~$0.30/trace projected
---
**8/13**
spike 005 — does the 3-channel integration actually compose?
```
$ pytest tests/ -v
============================== 38 passed in 3.43s ==============================
```
includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight.
---
**9/13**
DeepWiki-verified extension points:
TRL → subclass `GRPOTrainer._compute_loss`
VeRL → `@register_adv_est("grpo_composer")` + DataProto fields
OPSD → lift `generalized_jsd_loss` static method directly
both paths shipped in spikes/005/.
---
**10/13**
what I'm NOT claiming yet:
❌ trace-replay actually improves training
❌ TRL+OpenEnv produces clean traces at scale
❌ this matches Composer 2.5 quality
those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU.
---
**11/13**
biggest meta-lesson: **read primary sources yourself**
initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog
going from "implement from scratch" → "lift MIT code" was that re-read
---
**12/13**
specifically asking for:
- critical reads of the integration architecture
- pointers to adjacent published work I missed on multi-teacher trace replay
- reward-hacking proposals for the Feature Deletion env
- collaborators with a small GPU budget who want to run spike 002–004
---
**13/13**
all artifacts public, MIT licensed:
🤗 huggingface.co/Codeseys/composer-replication-framework
methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan
discussions tab open. would love feedback before I burn GPU.
---
## Alternative shorter version (5 tweets, for low-bandwidth post)
**1/5**
released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests
🤗 huggingface.co/Codeseys/composer-replication-framework
pre-experimental — methodology and economic feasibility, no training results yet
**2/5**
key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available
cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly
**3/5**
novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO
economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1
**4/5**
integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits
three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts
5-step train run on tiny model decreases loss with all 3 channels active
**5/5**
what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU)
asking for: critical reads, adjacent-work pointers, collaboration interest
repo: huggingface.co/Codeseys/composer-replication-framework
license: MIT
---
## LinkedIn / longer-form variant (1 post)
> Excited to release **a pre-experimental methodology paper + working code skeleton** for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers.
>
> Three contributions:
>
> 1. **Audit of Cursor's recipe.** The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at `siyan-zhao/OPSD`. Cursor cites both papers in their blog's footnote 1.
>
> 2. **Novel reward channel.** Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts.
>
> 3. **Verified integration architecture.** DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural.
>
> What I'm explicitly *not* claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors.
>
> Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework
>
> Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget.
>
> #LLM #ReinforcementLearning #AgenticCoding #OpenSource
---
*All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.*