# ADR-002 — Trace source for Spike 007 (real LLM-application traces) **Status**: Accepted **Date**: 2026-05-26 **Wave**: Phase 4 (deep work loop) ## Context Spike 007 closes V5 of the vision validation: "real LLM-application traces." Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement. The framework's brief explicitly said *real traces*, so we owe Spike 007 a primary-sourced ingestion path that converts a real, public, multi-turn agent trace format into our existing `TraceState` TypedDict. Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`): ```python class TraceState(TypedDict): state_id: str # unique within the trace messages: list[dict] # OpenAI-style conversation up to + incl this step student_action: str # what the student did at this step ``` (Earlier deep-work-loop notes called this `TraceExample` — that was a brain glitch; the actual type is `TraceState` and there is no `TraceExample`.) ## Options considered | Option | Schema | Acquisition | Signal density | License | |---|---|---|---|---| | (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT | | (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned | | (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT | | (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 | | (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) | | (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT | Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon). ## Decision **Option (a) — Claude Code session JSONL** at `~/.claude/projects//.jsonl`. Wins on every axis we care about for Spike 007: 1. **Acquisition cost: zero.** 1,015 real sessions already on this machine from the user's daily Claude Code use. No download, no consent negotiation, no rate limiting, no schema change risk during ingestion development. 2. **Schema stability: empirically validated.** The subagent ran a programmatic audit on 8 real sessions; record types are stable across all of them. Anthropic publishes user-facing docs for the format; four independent community projects (claude-code-cli-tools, claudeflow, etc.) ship working parsers including one with a JSON Schema validated against ~50,000 real messages. 3. **Signal density: maximal.** Every `tool_use` block is a candidate teacher-correction site. The 5 pre-selected sessions in the recon doc contain 6,762 tool_use messages (range 125 → 2,830 per session). That's 100× the density of Spike 001's 50 synthetic states. 4. **License: clean.** The trace files are user-owned files on the user's own machine. We don't redistribute them with the framework. The *ingester* code we write is MIT and ships in the framework. Anyone running the framework who wants real-trace ingestion uses their own local Claude Code sessions. ## Consequences ### Accepted - Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]` for the Claude Code JSONL format. - The TraceIngester ships as part of the package (Wave 10 packaging) under `composer_replication.ingestion.claude_code`. - The recon doc's 5 pre-selected real sessions become the **smoke fixture** for Spike 007's tests. We pin to a known set of session IDs so the test is deterministic locally; CI users substitute their own. - `ingestion/` directory pattern is established now to support adding ingesters for OpenHands and SWE-smith later if Spike 007 reveals signal-density gaps. ### Open questions resolved by ADR-002 1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`). A single assistant turn often emits multiple `tool_use` blocks for one reasoning step; treating each tool_use as a separate state would over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE §5. 2. **`student_action` mapping** — The literal text of the assistant turn (concatenated `text` blocks of the Claude message) becomes `student_action`. The teacher-replay channel asks N teachers to produce their version of "what should the assistant do here?" given the `messages` history; we then DPO-compare teacher consensus vs literal student text. 3. **Thinking blocks** — Strip `thinking` blocks from the message history passed to teachers (teachers don't have access to Claude's reasoning trace). KEEP them in the `student_action` for the student's own reproduction loop, since that's the actual generation we'd be RL-training. 4. **System prompt** — Inject a synthetic system prompt at message[0] of each `TraceState` describing "you are a coding agent" so teachers without their own coding-agent system prompt have a fair playing field. 5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions. Subagent traces have a different structure (parent task ID etc.) that would complicate the v0.1 ingester. ### Recon-flagged risk (not blocking) - Anthropic doesn't publish a versioned schema. The TraceIngester pins to known record-types as of 2026-05-26 and gracefully degrades on unknown types. If Anthropic ships a breaking change to the JSONL format, we'd need to bump a `schema_version` constant in the ingester. Acceptable ongoing maintenance burden. ### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT) - **Circularity / data-leakage in the teacher-replay channel.** Claude Code traces are produced by Claude. Our default teacher pool (`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a student on Claude's outputs while Claude is one of the teachers voting on what the student should do produces a biased disagreement signal: Claude's vote is correlated with the trace's existing `student_action` (which Claude originally produced). This biases the multi-teacher consensus toward the existing answer. - **Mitigation**: when ingesting Claude Code traces, the user should drop Claude from the teacher pool and use a non-Claude consensus (Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair). Documented here; not yet enforced in code. - **Open question for v0.2**: should `ClaudeCodeIngester` automatically annotate the source-model field on each trace and `replay_trace` automatically exclude same-family teachers? Defer the design until the post-replication phase reveals whether the bias is observable. ### Future ingesters Open the door for two more ingesters in v0.2: - `composer_replication.ingestion.openhands` — for users who run OpenHands - `composer_replication.ingestion.swe_smith` — for users who download the HF dataset Both follow the same `Iterator[TraceState]` contract. ## Source `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced including direct inspection of the user's local sessions, 2026-05-26).