baladithyab

Wave 5: full publication-materials drafts (pre-experimental release set)

639a760 12 days ago

7.37 kB

X / Twitter announcement thread (draft)

Posting suggestion: start the thread anchored on the HF repo URL so the algorithm can pick up engagement. Each tweet ≤ 280 chars. Numbered for clarity. Visual at tweet 1: a screenshot of the spike-001 verdict table or the integration-matrix.

1/13 new release: open replication framework for Cursor's Composer 2.5 — the post-trained Kimi K2.5 that runs at 5–10× the cost-efficiency of Opus 4.6 / GPT-5

with a novel multi-teacher trace-replay channel + 38 passing tests

🤗 huggingface.co/Codeseys/composer-replication-framework

2/13 the central technique Cursor calls "Targeted RL with Textual Feedback" — the bit that makes long-horizon agentic RL work — turns out to be mathematically the same as published SDPO/OPSD

Cursor cites the papers in footnote 1. there's already MIT-licensed code for it.

3/13 mechanism: when a 100K-token rollout has a localized error, generate a hint correcting the error, run forward pass with hint = teacher logits, run without hint = student logits, KL-distill student → teacher only at that turn

same model is both teacher and student

4/13 this sidesteps the credit-assignment nightmare: don't punish 100 good steps for 1 bad step

OPSD reference code (Zhao et al., MIT): github.com/siyan-zhao/OPSD SDPO paper (Hübotter et al., ICLR 2026 Workshop): arxiv.org/abs/2601.20802

5/13 the novel addition: multi-teacher trace-replay distillation (TR-DPO)

after a frozen rollout, replay each step against N external teachers (Opus, GPT-5, DeepSeek V4 Pro) extract DPO pairs from teacher disagreement with student add as a third reward channel

6/13 SDPO uses 1 model with privileged context. TR-DPO uses N models with no privileged context.

they're complementary, not competing. both bypass long-horizon credit assignment but tap different supervision sources.

unified loss: grpo + α·sdpo + β·trace_replay

7/13 spike 001 (kill-switch) — does N-teacher replay break the budget?

150 real OpenRouter calls, 0 errors: ✅ $0.98 per 50-step trace (vs $5 cap) ✅ p95 step latency 20.5s (vs 30s cap) ✅ p99 latency 23.2s (vs 60s cap)

with VOI gating in v0.1: ~$0.30/trace projected

8/13 spike 005 — does the 3-channel integration actually compose?

$ pytest tests/ -v
============================== 38 passed in 3.43s ==============================

includes a 5-step gradient run on a tiny custom model with all 3 channels active. loss decreases. they don't fight.

9/13 DeepWiki-verified extension points:

TRL → subclass GRPOTrainer._compute_loss VeRL → @register_adv_est("grpo_composer") + DataProto fields OPSD → lift generalized_jsd_loss static method directly

both paths shipped in spikes/005/.

10/13 what I'm NOT claiming yet:

❌ trace-replay actually improves training ❌ TRL+OpenEnv produces clean traces at scale ❌ this matches Composer 2.5 quality

those need spikes 002–004 (~$500 GPU budget + a couple weeks). this release is for early feedback before I burn GPU.

11/13 biggest meta-lesson: read primary sources yourself

initial parallel-research subagent summarized Cursor's blog correctly but missed footnote 1 (the SDPO citation) entirely + added several extrapolations not in the blog

going from "implement from scratch" → "lift MIT code" was that re-read

12/13 specifically asking for:

critical reads of the integration architecture
pointers to adjacent published work I missed on multi-teacher trace replay
reward-hacking proposals for the Feature Deletion env
collaborators with a small GPU budget who want to run spike 002–004

13/13 all artifacts public, MIT licensed:

🤗 huggingface.co/Codeseys/composer-replication-framework

methodology paper, blog audit, integration architecture, working code skeleton with 38 tests, full spike plan

discussions tab open. would love feedback before I burn GPU.

Alternative shorter version (5 tweets, for low-bandwidth post)

1/5 released: open replication framework for Cursor's Composer 2.5 with a novel multi-teacher trace-replay channel + 38 passing unit tests

🤗 huggingface.co/Codeseys/composer-replication-framework

pre-experimental — methodology and economic feasibility, no training results yet

2/5 key insight: Cursor's "Targeted RL with Textual Feedback" = published SDPO/OPSD with MIT reference code already available

cited in their blog's footnote 1, missed by my initial subagent research, only caught when I read the blog directly

3/5 novel addition: TR-DPO — replay frozen agentic traces with N external teachers, extract DPO pairs from teacher disagreement, layer on top of GRPO + SDPO

economic feasibility verified: $0.98 per 50-step trace, ~$0.30/trace with VOI gating in v0.1

4/5 integration architecture verified across TRL + VeRL + OpenEnv via DeepWiki primary-source audits

three reward channels compose cleanly via additive loss with independent α/β weights, no resource conflicts

5-step train run on tiny model decreases loss with all 3 channels active

5/5 what I'm NOT claiming: training results. those gate on spike 002–004 (~$500 GPU)

asking for: critical reads, adjacent-work pointers, collaboration interest

repo: huggingface.co/Codeseys/composer-replication-framework license: MIT

LinkedIn / longer-form variant (1 post)

Excited to release a pre-experimental methodology paper + working code skeleton for an open replication of Cursor's Composer 2.5 — the post-trained Kimi K2.5 model that achieves frontier agentic-coding performance at 5–10× lower serving cost than peers.

Three contributions:

Audit of Cursor's recipe. The headline technique they call "Targeted RL with Textual Feedback" turns out to be mathematically equivalent to published SDPO (Hübotter et al., ICLR 2026 Workshop) with MIT-licensed reference code at siyan-zhao/OPSD. Cursor cites both papers in their blog's footnote 1.

Novel reward channel. Multi-teacher trace-replay distillation: replay frozen agentic rollouts against N external teachers, extract DPO pairs from teacher disagreement. Stacks on top of RLVR + SDPO without resource conflicts.

Verified integration architecture. DeepWiki audits of TRL, VeRL, and OPSD give exact extension points. 38 unit tests pass including a 5-step gradient run on a tiny custom model — the integration claim is empirically tested, not just architectural.

What I'm explicitly not claiming: training results. Those gate on spike 002–004 (~$500 GPU budget + a few weeks of wallclock). Releasing pre-experimentally because the integration architecture is independently useful and early feedback may catch design errors.

Repository (MIT license): https://huggingface.co/Codeseys/composer-replication-framework

Looking for: critical reads of the integration architecture, pointers to adjacent published work, collaboration interest from teams with GPU budget.

#LLM #ReinforcementLearning #AgenticCoding #OpenSource

All three variants are drafts — pick the one that fits the platform's vibe. The 13-tweet thread is best for X engagement; the 5-tweet version for low-effort posting; the LinkedIn version for professional-network posting.