Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 7,531 Bytes
ac4bfb4 f16fa23 ac4bfb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | # ADR-002 — Trace source for Spike 007 (real LLM-application traces)
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: Phase 4 (deep work loop)
## Context
Spike 007 closes V5 of the vision validation: "real LLM-application traces."
Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement.
The framework's brief explicitly said *real traces*, so we owe Spike 007 a
primary-sourced ingestion path that converts a real, public, multi-turn agent
trace format into our existing `TraceState` TypedDict.
Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`):
```python
class TraceState(TypedDict):
state_id: str # unique within the trace
messages: list[dict] # OpenAI-style conversation up to + incl this step
student_action: str # what the student did at this step
```
(Earlier deep-work-loop notes called this `TraceExample` — that was a brain
glitch; the actual type is `TraceState` and there is no `TraceExample`.)
## Options considered
| Option | Schema | Acquisition | Signal density | License |
|---|---|---|---|---|
| (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT |
| (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned |
| (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT |
| (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 |
| (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) |
| (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT |
Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon).
## Decision
**Option (a) — Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionid>.jsonl`.
Wins on every axis we care about for Spike 007:
1. **Acquisition cost: zero.** 1,015 real sessions already on this machine
from the user's daily Claude Code use. No download, no consent
negotiation, no rate limiting, no schema change risk during ingestion
development.
2. **Schema stability: empirically validated.** The subagent ran a programmatic
audit on 8 real sessions; record types are stable across all of them.
Anthropic publishes user-facing docs for the format; four independent
community projects (claude-code-cli-tools, claudeflow, etc.) ship
working parsers including one with a JSON Schema validated against
~50,000 real messages.
3. **Signal density: maximal.** Every `tool_use` block is a candidate
teacher-correction site. The 5 pre-selected sessions in the recon doc
contain 6,762 tool_use messages (range 125 → 2,830 per session). That's
100× the density of Spike 001's 50 synthetic states.
4. **License: clean.** The trace files are user-owned files on the user's
own machine. We don't redistribute them with the framework. The
*ingester* code we write is MIT and ships in the framework. Anyone
running the framework who wants real-trace ingestion uses their own
local Claude Code sessions.
## Consequences
### Accepted
- Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]`
for the Claude Code JSONL format.
- The TraceIngester ships as part of the package (Wave 10 packaging) under
`composer_replication.ingestion.claude_code`.
- The recon doc's 5 pre-selected real sessions become the **smoke fixture**
for Spike 007's tests. We pin to a known set of session IDs so the test
is deterministic locally; CI users substitute their own.
- `ingestion/` directory pattern is established now to support adding
ingesters for OpenHands and SWE-smith later if Spike 007 reveals
signal-density gaps.
### Open questions resolved by ADR-002
1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`).
A single assistant turn often emits multiple `tool_use` blocks for one
reasoning step; treating each tool_use as a separate state would
over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE
§5.
2. **`student_action` mapping** — The literal text of the assistant turn
(concatenated `text` blocks of the Claude message) becomes
`student_action`. The teacher-replay channel asks N teachers to produce
their version of "what should the assistant do here?" given the
`messages` history; we then DPO-compare teacher consensus vs literal
student text.
3. **Thinking blocks** — Strip `thinking` blocks from the message history
passed to teachers (teachers don't have access to Claude's reasoning
trace). KEEP them in the `student_action` for the student's own
reproduction loop, since that's the actual generation we'd be RL-training.
4. **System prompt** — Inject a synthetic system prompt at message[0] of
each `TraceState` describing "you are a coding agent" so teachers
without their own coding-agent system prompt have a fair playing field.
5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions.
Subagent traces have a different structure (parent task ID etc.) that
would complicate the v0.1 ingester.
### Recon-flagged risk (not blocking)
- Anthropic doesn't publish a versioned schema. The TraceIngester pins to
known record-types as of 2026-05-26 and gracefully degrades on unknown
types. If Anthropic ships a breaking change to the JSONL format, we'd
need to bump a `schema_version` constant in the ingester. Acceptable
ongoing maintenance burden.
### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT)
- **Circularity / data-leakage in the teacher-replay channel.** Claude
Code traces are produced by Claude. Our default teacher pool
(`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a
student on Claude's outputs while Claude is one of the teachers
voting on what the student should do produces a biased disagreement
signal: Claude's vote is correlated with the trace's existing
`student_action` (which Claude originally produced). This biases the
multi-teacher consensus toward the existing answer.
- **Mitigation**: when ingesting Claude Code traces, the user should
drop Claude from the teacher pool and use a non-Claude consensus
(Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair).
Documented here; not yet enforced in code.
- **Open question for v0.2**: should `ClaudeCodeIngester` automatically
annotate the source-model field on each trace and `replay_trace`
automatically exclude same-family teachers? Defer the design until
the post-replication phase reveals whether the bias is observable.
### Future ingesters
Open the door for two more ingesters in v0.2:
- `composer_replication.ingestion.openhands` — for users who run OpenHands
- `composer_replication.ingestion.swe_smith` — for users who download the HF dataset
Both follow the same `Iterator[TraceState]` contract.
## Source
`docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced
including direct inspection of the user's local sessions, 2026-05-26).
|