composer-replication-framework / docs /adrs /ADR-002-trace-source.md
Codeseys's picture
Wave 11: cross-model adversarial review + honest down-revision
f16fa23

ADR-002 — Trace source for Spike 007 (real LLM-application traces)

Status: Accepted Date: 2026-05-26 Wave: Phase 4 (deep work loop)

Context

Spike 007 closes V5 of the vision validation: "real LLM-application traces." Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement. The framework's brief explicitly said real traces, so we owe Spike 007 a primary-sourced ingestion path that converts a real, public, multi-turn agent trace format into our existing TraceState TypedDict.

Existing schema (verified from spikes/005-integrated-trainer-skeleton/teacher_replay.py):

class TraceState(TypedDict):
    state_id: str           # unique within the trace
    messages: list[dict]    # OpenAI-style conversation up to + incl this step
    student_action: str     # what the student did at this step

(Earlier deep-work-loop notes called this TraceExample — that was a brain glitch; the actual type is TraceState and there is no TraceExample.)

Options considered

Option Schema Acquisition Signal density License
(a) Claude Code session JSONL Documented + 4 reverse-engineered schemas 1,015 local sessions zero-cost per-step tool_use blocks = ideal teacher-correction sites User-owned local files; framework MIT
(b) Cline VS Code extension No stable export schema Would need custom extraction Unknown until extracted Apache 2.0 (extension), trace data user-owned
(c) OpenHands trajectories Documented (v0/v1 in flux) Need to run OpenHands or download leaderboard submissions Strong MIT
(d) Aider chat history Markdown chat (lossy for tool calls) Local only if user runs Aider Weak — collapses tool structure Apache 2.0
(e) SWE-bench leaderboard trajs Heterogeneous, free-format Public download Strong but uneven Per-submission (mostly permissive)
(f) SWE-smith-trajectories (HF) Messages-only, structure collapsed HF dataset download Strong but lossy MIT

Source: docs/research/TRACE_SOURCE_RECONNAISSANCE.md (2026-05-26 subagent recon).

Decision

Option (a) — Claude Code session JSONL at ~/.claude/projects/<encoded>/<sessionid>.jsonl.

Wins on every axis we care about for Spike 007:

  1. Acquisition cost: zero. 1,015 real sessions already on this machine from the user's daily Claude Code use. No download, no consent negotiation, no rate limiting, no schema change risk during ingestion development.

  2. Schema stability: empirically validated. The subagent ran a programmatic audit on 8 real sessions; record types are stable across all of them. Anthropic publishes user-facing docs for the format; four independent community projects (claude-code-cli-tools, claudeflow, etc.) ship working parsers including one with a JSON Schema validated against ~50,000 real messages.

  3. Signal density: maximal. Every tool_use block is a candidate teacher-correction site. The 5 pre-selected sessions in the recon doc contain 6,762 tool_use messages (range 125 → 2,830 per session). That's 100× the density of Spike 001's 50 synthetic states.

  4. License: clean. The trace files are user-owned files on the user's own machine. We don't redistribute them with the framework. The ingester code we write is MIT and ships in the framework. Anyone running the framework who wants real-trace ingestion uses their own local Claude Code sessions.

Consequences

Accepted

  • Spike 007 implements TraceIngester.ingest(path: Path) -> Iterator[TraceState] for the Claude Code JSONL format.
  • The TraceIngester ships as part of the package (Wave 10 packaging) under composer_replication.ingestion.claude_code.
  • The recon doc's 5 pre-selected real sessions become the smoke fixture for Spike 007's tests. We pin to a known set of session IDs so the test is deterministic locally; CI users substitute their own.
  • ingestion/ directory pattern is established now to support adding ingesters for OpenHands and SWE-smith later if Spike 007 reveals signal-density gaps.

Open questions resolved by ADR-002

  1. Granularity — One TraceState per assistant turn (not per tool_use). A single assistant turn often emits multiple tool_use blocks for one reasoning step; treating each tool_use as a separate state would over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE §5.

  2. student_action mapping — The literal text of the assistant turn (concatenated text blocks of the Claude message) becomes student_action. The teacher-replay channel asks N teachers to produce their version of "what should the assistant do here?" given the messages history; we then DPO-compare teacher consensus vs literal student text.

  3. Thinking blocks — Strip thinking blocks from the message history passed to teachers (teachers don't have access to Claude's reasoning trace). KEEP them in the student_action for the student's own reproduction loop, since that's the actual generation we'd be RL-training.

  4. System prompt — Inject a synthetic system prompt at message[0] of each TraceState describing "you are a coding agent" so teachers without their own coding-agent system prompt have a fair playing field.

  5. Subagent traces — Skip them in v0.1; only ingest top-level sessions. Subagent traces have a different structure (parent task ID etc.) that would complicate the v0.1 ingester.

Recon-flagged risk (not blocking)

  • Anthropic doesn't publish a versioned schema. The TraceIngester pins to known record-types as of 2026-05-26 and gracefully degrades on unknown types. If Anthropic ships a breaking change to the JSONL format, we'd need to bump a schema_version constant in the ingester. Acceptable ongoing maintenance burden.

Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT)

  • Circularity / data-leakage in the teacher-replay channel. Claude Code traces are produced by Claude. Our default teacher pool (DEFAULT_TEACHERS) includes anthropic/claude-opus-4.7. Training a student on Claude's outputs while Claude is one of the teachers voting on what the student should do produces a biased disagreement signal: Claude's vote is correlated with the trace's existing student_action (which Claude originally produced). This biases the multi-teacher consensus toward the existing answer.
    • Mitigation: when ingesting Claude Code traces, the user should drop Claude from the teacher pool and use a non-Claude consensus (Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair). Documented here; not yet enforced in code.
    • Open question for v0.2: should ClaudeCodeIngester automatically annotate the source-model field on each trace and replay_trace automatically exclude same-family teachers? Defer the design until the post-replication phase reveals whether the bias is observable.

Future ingesters

Open the door for two more ingesters in v0.2:

  • composer_replication.ingestion.openhands — for users who run OpenHands
  • composer_replication.ingestion.swe_smith — for users who download the HF dataset

Both follow the same Iterator[TraceState] contract.

Source

docs/research/TRACE_SOURCE_RECONNAISSANCE.md (subagent recon, primary-sourced including direct inspection of the user's local sessions, 2026-05-26).