# TRACE_SOURCE_RECONNAISSANCE.md Spike 007 trace-source audit, feeding ADR-002. Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects//.jsonl`). --- ## 0. TL;DR Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us. The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes. --- ## 1. Context: TraceExample dataclass field reality **Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`: ```python class TraceState(TypedDict): state_id: str # unique within the trace messages: list[dict] # conversation up to and including this step's user prompt student_action: str # what the student actually did at this step class DPOPair(TypedDict): state_id: str state_messages: list[dict] chosen: str # teacher-consensus action rejected: str # student action n_teachers_agreeing: int ``` The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle. --- ## 2. Candidate audit summary Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes. | # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict | |---|---|---|---|---|---|---| | **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** | | b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject | | c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up | | d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject | | e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set | | f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** | The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase. --- ## 3. Chosen format spec — Claude Code session JSONL ### 3.1 Location and naming - **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`). Source: ("Transcripts are stored as JSONL at `~/.claude/projects//.jsonl`"). - **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.) Source: §"Project key encoding". - **File**: `.jsonl`. Subagent transcripts are `agent-.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions. Source: same `claude_skills` doc, §"Subagent File Location". - **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`. Source: ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.") ### 3.2 Common record fields Every record (both user and assistant types) carries: | field | type | meaning | |---|---|---| | `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) | | `uuid` | `string` | This record's UUID | | `sessionId` | `string` | UUID of the session (matches filename) | | `timestamp` | `string` (ISO-8601) | Wall-clock time of the record | | `cwd` | `string` | Absolute working directory | | `version` | `string` | Claude Code version (e.g. `"2.1.143"`) | | `gitBranch` | `string` | Empty string `""` when not in a git repo | | `isSidechain` | `boolean` | True for sub-agent (Task tool) chains | | `userType` | `string` | `"external"` or similar | | `type` | `string` | Discriminator — see §3.3 | | `entrypoint` | `string` | e.g. `"sdk-cli"` | Sources for these fields: - §"Type Definitions" → `BaseMessageEntry` - §"Top-Level Record Fields" - (machine-validated against ~50,000 messages from 480 real sessions) - Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above. ### 3.3 Record types (`type` discriminator) | `type` | Role | |---|---| | `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) | | `assistant` | Model output: text, `thinking`, and `tool_use` blocks | | `system` | Hook summaries, stop notices | | `summary` | Context-compaction markers | | `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output | | `queue-operation` | Prompt enqueue/dequeue events | | `file-history-snapshot` | File-state tracking for undo | | `last-prompt` | Bookkeeping for resume | Source: §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions. ### 3.4 The two record types we care about #### Assistant record carrying a tool call (the "student action") Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`: ```json { "type": "assistant", "uuid": "24a16a51-3133-4ba5-9d23-472864286154", "parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594", "sessionId": "39df59f0-…", "timestamp": "2026-05-16T04:52:21.947Z", "message": { "role": "assistant", "model": "claude-opus-4-7", "content": [ { "type": "tool_use", "id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq", "name": "Bash", "input": { "command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200", "description": "Check builder agent inbox" } } ], "stop_reason": "tool_use", "usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... } } } ``` The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block). #### User record carrying a tool result (the "observation") ```json { "type": "user", "uuid": "b9f9414b-…", "parentUuid": "24a16a51-…", // matches the assistant uuid above "sessionId": "39df59f0-…", "timestamp": "2026-05-16T04:52:23.229Z", "message": { "role": "user", "content": [ { "tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq", "type": "tool_result", "content": " No new messages", "is_error": false } ] }, "toolUseResult": { // duplicate, structured form "stdout": " No new messages", "stderr": "", "interrupted": false, "isImage": false, "noOutputExpected": false }, "sourceToolAssistantUUID": "24a16a51-…" // back-pointer to the assistant uuid } ``` User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string). ### 3.5 Schema stability - **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema. - **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API). - **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others). ### 3.6 Licensing - The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, ). - The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user. - Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`). - We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests. --- ## 4. Acquiring the 5 real example traces **Zero acquisition cost.** All five live on this machine right now. Discovery command (used during this audit): ```bash find ~/.claude/projects -name "*.jsonl" 2>/dev/null # → 1015 files ``` Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB: | # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path | |---|---|---|---|---|---| | 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` | | 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` | | 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` | | 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` | | 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` | (All five inspected programmatically during this audit — counts above are real, not estimates.) For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces. --- ## 5. Decision-relevant tradeoffs vs runners-up ### Why we are NOT picking OpenHands trajectories (c) - **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: , source code: . Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing). - **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket. - **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per , which is a flux risk. - **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior. ### Why we are NOT picking SWE-bench leaderboard trajectories (e) - **Pro**: hundreds of submissions on , with required `trajs/` folders. - **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via ); mini-swe-agent uses `.traj.json` with OpenAI messages format (). - **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike. ### Why we are NOT picking Aider (d) - The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per . Source code at shows it's literally `f.write(text)` of formatted prose with `####` for user input. - **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present. ### Why we are NOT picking Cline (b) - No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike. ### Why we are NOT picking SWE-smith-trajectories (f) - This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: . - **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check. --- ## 6. TraceIngester sketch > **Realised in v0.1 (Wave 17 update):** The realised ingester ships at > `composer_replication/ingestion/claude_code.py` exporting > `ClaudeCodeIngester`, with the spike at > `spikes/007-real-trace-ingestion/claude_code_ingester.py`. The > public production surface is: > > ```python > from pathlib import Path > from composer_replication.ingestion.claude_code import ClaudeCodeIngester > > ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True) > for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()): > # trace_state matches the TraceState TypedDict from §1 > ... > stats = ingester.last_stats # IngestionStats — turn counts, skip reasons > ``` > > The shipped `ClaudeCodeIngester` differs from the pre-spike sketch > below in: > - Class name: `ClaudeCodeIngester` (not `TraceIngester`) > - Module path: `composer_replication.ingestion.claude_code` (not > `spikes/007-trace-ingester/trace_ingester.py`) > - The constructor takes config kwargs (`system_prompt`, > `skip_sidechain`, `strip_thinking`, `max_history_tokens`); paths > are passed to `.ingest(Path)` per call instead of being held by the > ingester > - The yielded type is `TraceState` (matches §1) > > The pre-spike sketch below is preserved as historical proposal context. Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1). ```python # spikes/007-trace-ingester/trace_ingester.py from __future__ import annotations import json from collections.abc import Iterator from pathlib import Path from typing import Any # Re-use the existing TypedDicts from spike-005: # from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState # A "step" in the trace is each assistant record that ends in tool_use. The # state visible to the model at that step = all messages strictly before it, # in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s). def _record_to_chat_message(rec: dict) -> dict | None: """Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message dict, or return None for non-conversational records (queue-operation, attachment, file-history-snapshot, system, last-prompt, summary).""" t = rec.get("type") if t not in ("user", "assistant"): return None msg = rec.get("message") if not isinstance(msg, dict): return None role = msg.get("role") content = msg.get("content") if role not in ("user", "assistant") or content is None: return None # Strip thinking blocks — they are not portable across teacher models and # should not influence the teacher's decision at replay time. if isinstance(content, list): content = [c for c in content if not (isinstance(c, dict) and c.get("type") == "thinking")] return {"role": role, "content": content} def _serialize_action(content_blocks: list[dict]) -> str: """Canonicalize the student's action at a step. For tool_use steps: JSON-encode the (name, input) pairs. For text-only steps: return the concatenated text. """ tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"] if tool_uses: return json.dumps( [{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses], sort_keys=True, ) texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"] return "\n".join(t for t in texts if t) class TraceIngester: """Reads a Claude Code session JSONL and yields TraceState records. One TraceState is emitted per assistant record. The `messages` field is the full prior conversation (system + alternating user/assistant) up to but not including the current assistant turn; `student_action` is the canonicalized serialization of that turn's content blocks. """ def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None: self.skip_thinking = skip_thinking self.min_action_chars = min_action_chars def ingest(self, path: str | Path) -> Iterator[dict]: # yields TraceState path = Path(path) prior_messages: list[dict] = [] session_id_for_state = path.stem # filename = session UUID with path.open("r", encoding="utf-8") as f: for line_idx, line in enumerate(f): line = line.strip() if not line: continue try: rec = json.loads(line) except json.JSONDecodeError: continue # tolerate truncated last-line writes chat_msg = _record_to_chat_message(rec) if chat_msg is None: continue if chat_msg["role"] == "assistant": # Emit a TraceState representing "before this turn". blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else [] student_action = _serialize_action(blocks) if len(student_action) >= self.min_action_chars: yield { "state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}", "messages": list(prior_messages), # snapshot "student_action": student_action, } # Append to history regardless (so subsequent turns see it). prior_messages.append(chat_msg) ``` Notes: - We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`. - We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002. - `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest. - Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up. ### 6.1 Smoke-test plan (for Spike 007 itself) ```python ingester = TraceIngester() states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl")) # Expect roughly 197 states (matches asst-message count counted in §4). # Then teacher-replay on the first 5 states, confirm cost is in the # spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers). ``` Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up. --- ## 7. Open questions for ADR-002 1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual? 2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message. 3. Synthetic system prompt at replay time — yes/no? If yes, what content? 4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range? 5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics. --- ## 8. References (primary sources only) Anthropic / Claude Code official: - — session storage location and "JSONL, one JSON per line" - — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default" - — Commercial Terms vs Consumer Terms applicability - — proprietary license Community schemas (reverse-engineered from real session data): - — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions - §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union - — top-level fields, project-key encoding, subagent file location - — directory structure, plan-mode `slug` field - — TypeScript type definitions from session logs Runners-up reference points: - OpenHands events: , , , - SWE-bench experiments: - SWE-smith trajectories on HF: - mini-swe-agent traj.json: - swe-traj-complete (SWE-agent format example): - Aider history file format: , , Internal references: - `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list) - Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating