File size: 7,531 Bytes
ac4bfb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f16fa23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac4bfb4
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# ADR-002 — Trace source for Spike 007 (real LLM-application traces)

**Status**: Accepted
**Date**: 2026-05-26
**Wave**: Phase 4 (deep work loop)

## Context

Spike 007 closes V5 of the vision validation: "real LLM-application traces."
Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement.
The framework's brief explicitly said *real traces*, so we owe Spike 007 a
primary-sourced ingestion path that converts a real, public, multi-turn agent
trace format into our existing `TraceState` TypedDict.

Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`):

```python
class TraceState(TypedDict):
    state_id: str           # unique within the trace
    messages: list[dict]    # OpenAI-style conversation up to + incl this step
    student_action: str     # what the student did at this step
```

(Earlier deep-work-loop notes called this `TraceExample` — that was a brain
glitch; the actual type is `TraceState` and there is no `TraceExample`.)

## Options considered

| Option | Schema | Acquisition | Signal density | License |
|---|---|---|---|---|
| (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT |
| (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned |
| (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT |
| (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 |
| (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) |
| (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT |

Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon).

## Decision

**Option (a) — Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionid>.jsonl`.

Wins on every axis we care about for Spike 007:

1. **Acquisition cost: zero.** 1,015 real sessions already on this machine
   from the user's daily Claude Code use. No download, no consent
   negotiation, no rate limiting, no schema change risk during ingestion
   development.

2. **Schema stability: empirically validated.** The subagent ran a programmatic
   audit on 8 real sessions; record types are stable across all of them.
   Anthropic publishes user-facing docs for the format; four independent
   community projects (claude-code-cli-tools, claudeflow, etc.) ship
   working parsers including one with a JSON Schema validated against
   ~50,000 real messages.

3. **Signal density: maximal.** Every `tool_use` block is a candidate
   teacher-correction site. The 5 pre-selected sessions in the recon doc
   contain 6,762 tool_use messages (range 125 → 2,830 per session). That's
   100× the density of Spike 001's 50 synthetic states.

4. **License: clean.** The trace files are user-owned files on the user's
   own machine. We don't redistribute them with the framework. The
   *ingester* code we write is MIT and ships in the framework. Anyone
   running the framework who wants real-trace ingestion uses their own
   local Claude Code sessions.

## Consequences

### Accepted

- Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]`
  for the Claude Code JSONL format.
- The TraceIngester ships as part of the package (Wave 10 packaging) under
  `composer_replication.ingestion.claude_code`.
- The recon doc's 5 pre-selected real sessions become the **smoke fixture**
  for Spike 007's tests. We pin to a known set of session IDs so the test
  is deterministic locally; CI users substitute their own.
- `ingestion/` directory pattern is established now to support adding
  ingesters for OpenHands and SWE-smith later if Spike 007 reveals
  signal-density gaps.

### Open questions resolved by ADR-002

1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`).
   A single assistant turn often emits multiple `tool_use` blocks for one
   reasoning step; treating each tool_use as a separate state would
   over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE
   §5.

2. **`student_action` mapping** — The literal text of the assistant turn
   (concatenated `text` blocks of the Claude message) becomes
   `student_action`. The teacher-replay channel asks N teachers to produce
   their version of "what should the assistant do here?" given the
   `messages` history; we then DPO-compare teacher consensus vs literal
   student text.

3. **Thinking blocks** — Strip `thinking` blocks from the message history
   passed to teachers (teachers don't have access to Claude's reasoning
   trace). KEEP them in the `student_action` for the student's own
   reproduction loop, since that's the actual generation we'd be RL-training.

4. **System prompt** — Inject a synthetic system prompt at message[0] of
   each `TraceState` describing "you are a coding agent" so teachers
   without their own coding-agent system prompt have a fair playing field.

5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions.
   Subagent traces have a different structure (parent task ID etc.) that
   would complicate the v0.1 ingester.

### Recon-flagged risk (not blocking)

- Anthropic doesn't publish a versioned schema. The TraceIngester pins to
  known record-types as of 2026-05-26 and gracefully degrades on unknown
  types. If Anthropic ships a breaking change to the JSONL format, we'd
  need to bump a `schema_version` constant in the ingester. Acceptable
  ongoing maintenance burden.

### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT)

- **Circularity / data-leakage in the teacher-replay channel.** Claude
  Code traces are produced by Claude. Our default teacher pool
  (`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a
  student on Claude's outputs while Claude is one of the teachers
  voting on what the student should do produces a biased disagreement
  signal: Claude's vote is correlated with the trace's existing
  `student_action` (which Claude originally produced). This biases the
  multi-teacher consensus toward the existing answer.
  - **Mitigation**: when ingesting Claude Code traces, the user should
    drop Claude from the teacher pool and use a non-Claude consensus
    (Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair).
    Documented here; not yet enforced in code.
  - **Open question for v0.2**: should `ClaudeCodeIngester` automatically
    annotate the source-model field on each trace and `replay_trace`
    automatically exclude same-family teachers? Defer the design until
    the post-replication phase reveals whether the bias is observable.

### Future ingesters

Open the door for two more ingesters in v0.2:
- `composer_replication.ingestion.openhands` — for users who run OpenHands
- `composer_replication.ingestion.swe_smith` — for users who download the HF dataset

Both follow the same `Iterator[TraceState]` contract.

## Source

`docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced
including direct inspection of the user's local sessions, 2026-05-26).