composer-replication-framework / docs /research /TRACE_SOURCE_RECONNAISSANCE.md
Codeseys's picture
Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports
a84c060
# TRACE_SOURCE_RECONNAISSANCE.md
Spike 007 trace-source audit, feeding ADR-002.
Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).
---
## 0. TL;DR
Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.
The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.
---
## 1. Context: TraceExample dataclass field reality
**Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
`/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:
```python
class TraceState(TypedDict):
state_id: str # unique within the trace
messages: list[dict] # conversation up to and including this step's user prompt
student_action: str # what the student actually did at this step
class DPOPair(TypedDict):
state_id: str
state_messages: list[dict]
chosen: str # teacher-consensus action
rejected: str # student action
n_teachers_agreeing: int
```
The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle.
---
## 2. Candidate audit summary
Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.
| # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict |
|---|---|---|---|---|---|---|
| **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** |
| b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject |
| c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up |
| d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject |
| e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set |
| f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** |
The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.
---
## 3. Chosen format spec — Claude Code session JSONL
### 3.1 Location and naming
- **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
- **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
- **File**: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions.
Source: same `claude_skills` doc, §"Subagent File Location".
- **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")
### 3.2 Common record fields
Every record (both user and assistant types) carries:
| field | type | meaning |
|---|---|---|
| `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) |
| `uuid` | `string` | This record's UUID |
| `sessionId` | `string` | UUID of the session (matches filename) |
| `timestamp` | `string` (ISO-8601) | Wall-clock time of the record |
| `cwd` | `string` | Absolute working directory |
| `version` | `string` | Claude Code version (e.g. `"2.1.143"`) |
| `gitBranch` | `string` | Empty string `""` when not in a git repo |
| `isSidechain` | `boolean` | True for sub-agent (Task tool) chains |
| `userType` | `string` | `"external"` or similar |
| `type` | `string` | Discriminator — see §3.3 |
| `entrypoint` | `string` | e.g. `"sdk-cli"` |
Sources for these fields:
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
- Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.
### 3.3 Record types (`type` discriminator)
| `type` | Role |
|---|---|
| `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) |
| `assistant` | Model output: text, `thinking`, and `tool_use` blocks |
| `system` | Hook summaries, stop notices |
| `summary` | Context-compaction markers |
| `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output |
| `queue-operation` | Prompt enqueue/dequeue events |
| `file-history-snapshot` | File-state tracking for undo |
| `last-prompt` | Bookkeeping for resume |
Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.
### 3.4 The two record types we care about
#### Assistant record carrying a tool call (the "student action")
Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:
```json
{
"type": "assistant",
"uuid": "24a16a51-3133-4ba5-9d23-472864286154",
"parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
"sessionId": "39df59f0-…",
"timestamp": "2026-05-16T04:52:21.947Z",
"message": {
"role": "assistant",
"model": "claude-opus-4-7",
"content": [
{
"type": "tool_use",
"id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
"name": "Bash",
"input": {
"command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
"description": "Check builder agent inbox"
}
}
],
"stop_reason": "tool_use",
"usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
}
}
```
The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).
#### User record carrying a tool result (the "observation")
```json
{
"type": "user",
"uuid": "b9f9414b-…",
"parentUuid": "24a16a51-…", // matches the assistant uuid above
"sessionId": "39df59f0-…",
"timestamp": "2026-05-16T04:52:23.229Z",
"message": {
"role": "user",
"content": [
{
"tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
"type": "tool_result",
"content": " No new messages",
"is_error": false
}
]
},
"toolUseResult": { // duplicate, structured form
"stdout": " No new messages",
"stderr": "",
"interrupted": false,
"isImage": false,
"noOutputExpected": false
},
"sourceToolAssistantUUID": "24a16a51-…" // back-pointer to the assistant uuid
}
```
User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).
### 3.5 Schema stability
- **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema.
- **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
- **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).
### 3.6 Licensing
- The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
- The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
- Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
- We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests.
---
## 4. Acquiring the 5 real example traces
**Zero acquisition cost.** All five live on this machine right now.
Discovery command (used during this audit):
```bash
find ~/.claude/projects -name "*.jsonl" 2>/dev/null
# → 1015 files
```
Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:
| # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path |
|---|---|---|---|---|---|
| 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` |
| 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` |
| 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` |
| 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` |
| 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` |
(All five inspected programmatically during this audit — counts above are real, not estimates.)
For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces.
---
## 5. Decision-relevant tradeoffs vs runners-up
### Why we are NOT picking OpenHands trajectories (c)
- **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing).
- **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
- **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
- **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.
### Why we are NOT picking SWE-bench leaderboard trajectories (e)
- **Pro**: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
- **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
- **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike.
### Why we are NOT picking Aider (d)
- The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
- **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.
### Why we are NOT picking Cline (b)
- No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.
### Why we are NOT picking SWE-smith-trajectories (f)
- This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
- **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.
---
## 6. TraceIngester sketch
> **Realised in v0.1 (Wave 17 update):** The realised ingester ships at
> `composer_replication/ingestion/claude_code.py` exporting
> `ClaudeCodeIngester`, with the spike at
> `spikes/007-real-trace-ingestion/claude_code_ingester.py`. The
> public production surface is:
>
> ```python
> from pathlib import Path
> from composer_replication.ingestion.claude_code import ClaudeCodeIngester
>
> ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True)
> for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()):
> # trace_state matches the TraceState TypedDict from §1
> ...
> stats = ingester.last_stats # IngestionStats — turn counts, skip reasons
> ```
>
> The shipped `ClaudeCodeIngester` differs from the pre-spike sketch
> below in:
> - Class name: `ClaudeCodeIngester` (not `TraceIngester`)
> - Module path: `composer_replication.ingestion.claude_code` (not
> `spikes/007-trace-ingester/trace_ingester.py`)
> - The constructor takes config kwargs (`system_prompt`,
> `skip_sidechain`, `strip_thinking`, `max_history_tokens`); paths
> are passed to `.ingest(Path)` per call instead of being held by the
> ingester
> - The yielded type is `TraceState` (matches §1)
>
> The pre-spike sketch below is preserved as historical proposal context.
Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
```python
# spikes/007-trace-ingester/trace_ingester.py
from __future__ import annotations
import json
from collections.abc import Iterator
from pathlib import Path
from typing import Any
# Re-use the existing TypedDicts from spike-005:
# from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState
# A "step" in the trace is each assistant record that ends in tool_use. The
# state visible to the model at that step = all messages strictly before it,
# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).
def _record_to_chat_message(rec: dict) -> dict | None:
"""Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
dict, or return None for non-conversational records (queue-operation,
attachment, file-history-snapshot, system, last-prompt, summary)."""
t = rec.get("type")
if t not in ("user", "assistant"):
return None
msg = rec.get("message")
if not isinstance(msg, dict):
return None
role = msg.get("role")
content = msg.get("content")
if role not in ("user", "assistant") or content is None:
return None
# Strip thinking blocks — they are not portable across teacher models and
# should not influence the teacher's decision at replay time.
if isinstance(content, list):
content = [c for c in content
if not (isinstance(c, dict) and c.get("type") == "thinking")]
return {"role": role, "content": content}
def _serialize_action(content_blocks: list[dict]) -> str:
"""Canonicalize the student's action at a step.
For tool_use steps: JSON-encode the (name, input) pairs.
For text-only steps: return the concatenated text.
"""
tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
if tool_uses:
return json.dumps(
[{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
sort_keys=True,
)
texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
return "\n".join(t for t in texts if t)
class TraceIngester:
"""Reads a Claude Code session JSONL and yields TraceState records.
One TraceState is emitted per assistant record. The `messages` field is the
full prior conversation (system + alternating user/assistant) up to but not
including the current assistant turn; `student_action` is the canonicalized
serialization of that turn's content blocks.
"""
def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
self.skip_thinking = skip_thinking
self.min_action_chars = min_action_chars
def ingest(self, path: str | Path) -> Iterator[dict]: # yields TraceState
path = Path(path)
prior_messages: list[dict] = []
session_id_for_state = path.stem # filename = session UUID
with path.open("r", encoding="utf-8") as f:
for line_idx, line in enumerate(f):
line = line.strip()
if not line:
continue
try:
rec = json.loads(line)
except json.JSONDecodeError:
continue # tolerate truncated last-line writes
chat_msg = _record_to_chat_message(rec)
if chat_msg is None:
continue
if chat_msg["role"] == "assistant":
# Emit a TraceState representing "before this turn".
blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
student_action = _serialize_action(blocks)
if len(student_action) >= self.min_action_chars:
yield {
"state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
"messages": list(prior_messages), # snapshot
"student_action": student_action,
}
# Append to history regardless (so subsequent turns see it).
prior_messages.append(chat_msg)
```
Notes:
- We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
- We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
- `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
- Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.
### 6.1 Smoke-test plan (for Spike 007 itself)
```python
ingester = TraceIngester()
states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
# Expect roughly 197 states (matches asst-message count counted in §4).
# Then teacher-replay on the first 5 states, confirm cost is in the
# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
```
Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up.
---
## 7. Open questions for ADR-002
1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
3. Synthetic system prompt at replay time — yes/no? If yes, what content?
4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.
---
## 8. References (primary sources only)
Anthropic / Claude Code official:
- <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
- <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
- <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
- <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license
Community schemas (reverse-engineered from real session data):
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
- <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
- <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs
Runners-up reference points:
- OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
- SWE-bench experiments: <https://github.com/swe-bench/experiments>
- SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
- mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
- swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
- Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>
Internal references:
- `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
- Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating