Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # ADR-002 — Trace source for Spike 007 (real LLM-application traces) | |
| **Status**: Accepted | |
| **Date**: 2026-05-26 | |
| **Wave**: Phase 4 (deep work loop) | |
| ## Context | |
| Spike 007 closes V5 of the vision validation: "real LLM-application traces." | |
| Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement. | |
| The framework's brief explicitly said *real traces*, so we owe Spike 007 a | |
| primary-sourced ingestion path that converts a real, public, multi-turn agent | |
| trace format into our existing `TraceState` TypedDict. | |
| Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`): | |
| ```python | |
| class TraceState(TypedDict): | |
| state_id: str # unique within the trace | |
| messages: list[dict] # OpenAI-style conversation up to + incl this step | |
| student_action: str # what the student did at this step | |
| ``` | |
| (Earlier deep-work-loop notes called this `TraceExample` — that was a brain | |
| glitch; the actual type is `TraceState` and there is no `TraceExample`.) | |
| ## Options considered | |
| | Option | Schema | Acquisition | Signal density | License | | |
| |---|---|---|---|---| | |
| | (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT | | |
| | (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned | | |
| | (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT | | |
| | (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 | | |
| | (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) | | |
| | (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT | | |
| Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon). | |
| ## Decision | |
| **Option (a) — Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionid>.jsonl`. | |
| Wins on every axis we care about for Spike 007: | |
| 1. **Acquisition cost: zero.** 1,015 real sessions already on this machine | |
| from the user's daily Claude Code use. No download, no consent | |
| negotiation, no rate limiting, no schema change risk during ingestion | |
| development. | |
| 2. **Schema stability: empirically validated.** The subagent ran a programmatic | |
| audit on 8 real sessions; record types are stable across all of them. | |
| Anthropic publishes user-facing docs for the format; four independent | |
| community projects (claude-code-cli-tools, claudeflow, etc.) ship | |
| working parsers including one with a JSON Schema validated against | |
| ~50,000 real messages. | |
| 3. **Signal density: maximal.** Every `tool_use` block is a candidate | |
| teacher-correction site. The 5 pre-selected sessions in the recon doc | |
| contain 6,762 tool_use messages (range 125 → 2,830 per session). That's | |
| 100× the density of Spike 001's 50 synthetic states. | |
| 4. **License: clean.** The trace files are user-owned files on the user's | |
| own machine. We don't redistribute them with the framework. The | |
| *ingester* code we write is MIT and ships in the framework. Anyone | |
| running the framework who wants real-trace ingestion uses their own | |
| local Claude Code sessions. | |
| ## Consequences | |
| ### Accepted | |
| - Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]` | |
| for the Claude Code JSONL format. | |
| - The TraceIngester ships as part of the package (Wave 10 packaging) under | |
| `composer_replication.ingestion.claude_code`. | |
| - The recon doc's 5 pre-selected real sessions become the **smoke fixture** | |
| for Spike 007's tests. We pin to a known set of session IDs so the test | |
| is deterministic locally; CI users substitute their own. | |
| - `ingestion/` directory pattern is established now to support adding | |
| ingesters for OpenHands and SWE-smith later if Spike 007 reveals | |
| signal-density gaps. | |
| ### Open questions resolved by ADR-002 | |
| 1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`). | |
| A single assistant turn often emits multiple `tool_use` blocks for one | |
| reasoning step; treating each tool_use as a separate state would | |
| over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE | |
| §5. | |
| 2. **`student_action` mapping** — The literal text of the assistant turn | |
| (concatenated `text` blocks of the Claude message) becomes | |
| `student_action`. The teacher-replay channel asks N teachers to produce | |
| their version of "what should the assistant do here?" given the | |
| `messages` history; we then DPO-compare teacher consensus vs literal | |
| student text. | |
| 3. **Thinking blocks** — Strip `thinking` blocks from the message history | |
| passed to teachers (teachers don't have access to Claude's reasoning | |
| trace). KEEP them in the `student_action` for the student's own | |
| reproduction loop, since that's the actual generation we'd be RL-training. | |
| 4. **System prompt** — Inject a synthetic system prompt at message[0] of | |
| each `TraceState` describing "you are a coding agent" so teachers | |
| without their own coding-agent system prompt have a fair playing field. | |
| 5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions. | |
| Subagent traces have a different structure (parent task ID etc.) that | |
| would complicate the v0.1 ingester. | |
| ### Recon-flagged risk (not blocking) | |
| - Anthropic doesn't publish a versioned schema. The TraceIngester pins to | |
| known record-types as of 2026-05-26 and gracefully degrades on unknown | |
| types. If Anthropic ships a breaking change to the JSONL format, we'd | |
| need to bump a `schema_version` constant in the ingester. Acceptable | |
| ongoing maintenance burden. | |
| ### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT) | |
| - **Circularity / data-leakage in the teacher-replay channel.** Claude | |
| Code traces are produced by Claude. Our default teacher pool | |
| (`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a | |
| student on Claude's outputs while Claude is one of the teachers | |
| voting on what the student should do produces a biased disagreement | |
| signal: Claude's vote is correlated with the trace's existing | |
| `student_action` (which Claude originally produced). This biases the | |
| multi-teacher consensus toward the existing answer. | |
| - **Mitigation**: when ingesting Claude Code traces, the user should | |
| drop Claude from the teacher pool and use a non-Claude consensus | |
| (Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair). | |
| Documented here; not yet enforced in code. | |
| - **Open question for v0.2**: should `ClaudeCodeIngester` automatically | |
| annotate the source-model field on each trace and `replay_trace` | |
| automatically exclude same-family teachers? Defer the design until | |
| the post-replication phase reveals whether the bias is observable. | |
| ### Future ingesters | |
| Open the door for two more ingesters in v0.2: | |
| - `composer_replication.ingestion.openhands` — for users who run OpenHands | |
| - `composer_replication.ingestion.swe_smith` — for users who download the HF dataset | |
| Both follow the same `Iterator[TraceState]` contract. | |
| ## Source | |
| `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced | |
| including direct inspection of the user's local sessions, 2026-05-26). | |