composer-replication-framework / docs /research /TRACE_SOURCE_RECONNAISSANCE.md

Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports

a84c060 12 days ago

29 kB

	# TRACE_SOURCE_RECONNAISSANCE.md

	Spike 007 trace-source audit, feeding ADR-002.

	Status: DECIDED — recommend (a) Claude Code session JSONL (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).

	---

	## 0. TL;DR

	Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has 1,015 .jsonl sessions on this machine today; the eight largest sampled span 550 → 17,315 lines and contain 6,762 multi-turn `tool_use` messages. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.

	The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.

	---

	## 1. Context: TraceExample dataclass field reality

	Important correction to the parent task description. The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
	`/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:

	```python
	class TraceState(TypedDict):
	state_id: str # unique within the trace
	messages: list[dict] # conversation up to and including this step's user prompt
	student_action: str # what the student actually did at this step

	class DPOPair(TypedDict):
	state_id: str
	state_messages: list[dict]
	chosen: str # teacher-consensus action
	rejected: str # student action
	n_teachers_agreeing: int
	```

	The mapping sketch in §6 below targets `TraceState` (the input to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str \| None, reward: float \| None, hint_text: str \| None}` — flagged for ADR-002 to settle.

	---

	## 2. Candidate audit summary

	Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.

	\| # \| Candidate \| Schema documented \| Real ≥5 multi-turn traces \| Hint-receptive signal density \| License OK \| Verdict \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| a \| Claude Code JSONL (`~/.claude/projects/`) \| `~` Anthropic publishes high-level format note; community schemas are detailed and validated \| + 1,015 local sessions, 5+ trivially \| + Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites \| + User-owned local files; framework MIT \| CHOSEN \|
	\| b \| Cline VS Code extension \| `-` No published stable export schema \| `~` Requires running Cline + manual export \| `~` Plausible if exported but unverified \| `~` Cline source Apache-2.0 but trace format isn't a stable contract \| reject \|
	\| c \| OpenHands trajectories \| + Well-documented (events/, base_state.json, Pydantic Event models) \| `-` Need to run OpenHands or download eval traces — not zero-cost \| + ActionEvent/ObservationEvent split is conceptually ideal \| + OpenHands MIT-licensed \| strong runner-up \|
	\| d \| Aider chat history \| `~` Format is "markdown, level-4 headings for user input" — fragile \| `~` Available if Aider was used \| `-` Tool calls are flattened into prose; recovering structured actions is lossy \| `+` Aider Apache-2.0 \| reject \|
	\| e \| SWE-bench / Lite leaderboard `trajs/` \| `-` Each submitter chooses a free-form text format (md/json/yaml) \| + ~hundreds of submissions on github.com/swe-bench/experiments \| `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs \| + Public submissions with usage rights for research \| reject as primary; usable as future cross-validation set \|
	\| f \| SWE-smith-trajectories on HF \| + Standard OpenAI messages format, documented per dataset card \| + 5,017 trajectories, 76,002 rows, public \| + Single-attempt per-instance SWE-agent runs \| + Apache-2.0 dataset license \| strong runner-up; complement, not replacement \|

	The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a different question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us the user's actual workflow. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.

	---

	## 3. Chosen format spec — Claude Code session JSONL

	### 3.1 Location and naming

	- Root: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
	Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
	- Project-key encoding: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
	Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
	- File: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should skip files starting with `agent-` when listing main sessions.
	Source: same `claude_skills` doc, §"Subagent File Location".
	- Encoding: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
	Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")

	### 3.2 Common record fields

	Every record (both user and assistant types) carries:

	\| field \| type \| meaning \|
	\|---\|---\|---\|
	\| `parentUuid` \| `string \\| null` \| UUID of the parent record (null on the first record) \|
	\| `uuid` \| `string` \| This record's UUID \|
	\| `sessionId` \| `string` \| UUID of the session (matches filename) \|
	\| `timestamp` \| `string` (ISO-8601) \| Wall-clock time of the record \|
	\| `cwd` \| `string` \| Absolute working directory \|
	\| `version` \| `string` \| Claude Code version (e.g. `"2.1.143"`) \|
	\| `gitBranch` \| `string` \| Empty string `""` when not in a git repo \|
	\| `isSidechain` \| `boolean` \| True for sub-agent (Task tool) chains \|
	\| `userType` \| `string` \| `"external"` or similar \|
	\| `type` \| `string` \| Discriminator — see §3.3 \|
	\| `entrypoint` \| `string` \| e.g. `"sdk-cli"` \|

	Sources for these fields:
	- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
	- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
	- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
	- Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.

	### 3.3 Record types (`type` discriminator)

	\| `type` \| Role \|
	\|---\|---\|
	\| `user` \| Both human prompts AND tool results (distinguished by `message.content[].type`) \|
	\| `assistant` \| Model output: text, `thinking`, and `tool_use` blocks \|
	\| `system` \| Hook summaries, stop notices \|
	\| `summary` \| Context-compaction markers \|
	\| `attachment` \| Hook stdout/stderr, e.g. `SessionStart` hook output \|
	\| `queue-operation` \| Prompt enqueue/dequeue events \|
	\| `file-history-snapshot` \| File-state tracking for undo \|
	\| `last-prompt` \| Bookkeeping for resume \|

	Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.

	### 3.4 The two record types we care about

	#### Assistant record carrying a tool call (the "student action")

	Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:

	```json
	{
	"type": "assistant",
	"uuid": "24a16a51-3133-4ba5-9d23-472864286154",
	"parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
	"sessionId": "39df59f0-…",
	"timestamp": "2026-05-16T04:52:21.947Z",
	"message": {
	"role": "assistant",
	"model": "claude-opus-4-7",
	"content": [
	{
	"type": "tool_use",
	"id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
	"name": "Bash",
	"input": {
	"command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 \| head -200",
	"description": "Check builder agent inbox"
	}
	}
	],
	"stop_reason": "tool_use",
	"usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
	}
	}
	```

	The student's action at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).

	#### User record carrying a tool result (the "observation")

	```json
	{
	"type": "user",
	"uuid": "b9f9414b-…",
	"parentUuid": "24a16a51-…", // matches the assistant uuid above
	"sessionId": "39df59f0-…",
	"timestamp": "2026-05-16T04:52:23.229Z",
	"message": {
	"role": "user",
	"content": [
	{
	"tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
	"type": "tool_result",
	"content": " No new messages",
	"is_error": false
	}
	]
	},
	"toolUseResult": { // duplicate, structured form
	"stdout": " No new messages",
	"stderr": "",
	"interrupted": false,
	"isImage": false,
	"noOutputExpected": false
	},
	"sourceToolAssistantUUID": "24a16a51-…" // back-pointer to the assistant uuid
	}
	```

	User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).

	### 3.5 Schema stability

	- Anthropic's official documentation acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does not publish a versioned schema.
	- Practical stability: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
	- Mitigation: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).

	### 3.6 Licensing

	- The Claude Code binary is proprietary (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
	- The session JSONL files are local user data generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
	- Our framework is MIT-licensed and we are not redistributing the Claude Code binary or any third-party trace files. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
	- We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the ingester, plus a tiny synthetic-fixture trace for unit tests.

	---

	## 4. Acquiring the 5 real example traces

	Zero acquisition cost. All five live on this machine right now.

	Discovery command (used during this audit):

	```bash
	find ~/.claude/projects -name "*.jsonl" 2>/dev/null
	# → 1015 files
	```

	Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:

	\| # \| Tool-use msgs \| User msgs \| Asst msgs \| Total lines \| Path \|
	\|---\|---\|---\|---\|---\|---\|
	\| 1 \| 2,830 \| 3,199 \| 4,325 \| 17,315 \| `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` \|
	\| 2 \| 1,350 \| 1,407 \| 2,016 \| 7,673 \| `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` \|
	\| 3 \| 984 \| 1,032 \| 1,549 \| 5,783 \| `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` \|
	\| 4 \| 717 \| 759 \| 1,142 \| 4,036 \| `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` \|
	\| 5 \| 125 \| 126 \| 197 \| 629 \| `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` \|

	(All five inspected programmatically during this audit — counts above are real, not estimates.)

	For users on other machines: `find ~/.claude/projects -name '.jsonl' -size +50k \| head` will surface candidates. For repository CI we will commit a small (~5 KB) synthetic* fixture conforming to the schema, never any of the user's real traces.

	---

	## 5. Decision-relevant tradeoffs vs runners-up

	### Why we are NOT picking OpenHands trajectories (c)
	- Pro: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is more normalized than Claude Code's (explicit Action/Observation typing).
	- Con: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has run OpenHands locally. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
	- Decisive: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces that already exist. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
	- Future use: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.

	### Why we are NOT picking SWE-bench leaderboard trajectories (e)
	- Pro: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
	- Con: leaderboard rules say "The reasoning trace can be represented with any text based file format (e.g. md, json, yaml)" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
	- Decisive: heterogeneous schema = fragile ingester = wrong choice for first spike.

	### Why we are NOT picking Aider (d)
	- The `chat_history_file` is markdown (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
	- Decisive: tool calls in Aider are applied as edits, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.

	### Why we are NOT picking Cline (b)
	- No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.

	### Why we are NOT picking SWE-smith-trajectories (f)
	- This is the strongest external dataset we found and should be Spike 007's stretch goal / Spike 008's primary: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
	- Why not first: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is less signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.

	---

	## 6. TraceIngester sketch

	> Realised in v0.1 (Wave 17 update): The realised ingester ships at
	> `composer_replication/ingestion/claude_code.py` exporting
	> `ClaudeCodeIngester`, with the spike at
	> `spikes/007-real-trace-ingestion/claude_code_ingester.py`. The
	> public production surface is:
	>
	> ```python
	> from pathlib import Path
	> from composer_replication.ingestion.claude_code import ClaudeCodeIngester
	>
	> ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True)
	> for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()):
	> # trace_state matches the TraceState TypedDict from §1
	> ...
	> stats = ingester.last_stats # IngestionStats — turn counts, skip reasons
	> ```
	>
	> The shipped `ClaudeCodeIngester` differs from the pre-spike sketch
	> below in:
	> - Class name: `ClaudeCodeIngester` (not `TraceIngester`)
	> - Module path: `composer_replication.ingestion.claude_code` (not
	> `spikes/007-trace-ingester/trace_ingester.py`)
	> - The constructor takes config kwargs (`system_prompt`,
	> `skip_sidechain`, `strip_thinking`, `max_history_tokens`); paths
	> are passed to `.ingest(Path)` per call instead of being held by the
	> ingester
	> - The yielded type is `TraceState` (matches §1)
	>
	> The pre-spike sketch below is preserved as historical proposal context.

	Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).

	```python
	# spikes/007-trace-ingester/trace_ingester.py
	from __future__ import annotations
	import json
	from collections.abc import Iterator
	from pathlib import Path
	from typing import Any

	# Re-use the existing TypedDicts from spike-005:
	# from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState

	# A "step" in the trace is each assistant record that ends in tool_use. The
	# state visible to the model at that step = all messages strictly before it,
	# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).

	def _record_to_chat_message(rec: dict) -> dict \| None:
	"""Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
	dict, or return None for non-conversational records (queue-operation,
	attachment, file-history-snapshot, system, last-prompt, summary)."""
	t = rec.get("type")
	if t not in ("user", "assistant"):
	return None
	msg = rec.get("message")
	if not isinstance(msg, dict):
	return None
	role = msg.get("role")
	content = msg.get("content")
	if role not in ("user", "assistant") or content is None:
	return None
	# Strip thinking blocks — they are not portable across teacher models and
	# should not influence the teacher's decision at replay time.
	if isinstance(content, list):
	content = [c for c in content
	if not (isinstance(c, dict) and c.get("type") == "thinking")]
	return {"role": role, "content": content}


	def _serialize_action(content_blocks: list[dict]) -> str:
	"""Canonicalize the student's action at a step.

	For tool_use steps: JSON-encode the (name, input) pairs.
	For text-only steps: return the concatenated text.
	"""
	tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
	if tool_uses:
	return json.dumps(
	[{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
	sort_keys=True,
	)
	texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
	return "\n".join(t for t in texts if t)


	class TraceIngester:
	"""Reads a Claude Code session JSONL and yields TraceState records.

	One TraceState is emitted per assistant record. The `messages` field is the
	full prior conversation (system + alternating user/assistant) up to but not
	including the current assistant turn; `student_action` is the canonicalized
	serialization of that turn's content blocks.
	"""

	def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
	self.skip_thinking = skip_thinking
	self.min_action_chars = min_action_chars

	def ingest(self, path: str \| Path) -> Iterator[dict]: # yields TraceState
	path = Path(path)
	prior_messages: list[dict] = []
	session_id_for_state = path.stem # filename = session UUID

	with path.open("r", encoding="utf-8") as f:
	for line_idx, line in enumerate(f):
	line = line.strip()
	if not line:
	continue
	try:
	rec = json.loads(line)
	except json.JSONDecodeError:
	continue # tolerate truncated last-line writes

	chat_msg = _record_to_chat_message(rec)
	if chat_msg is None:
	continue

	if chat_msg["role"] == "assistant":
	# Emit a TraceState representing "before this turn".
	blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
	student_action = _serialize_action(blocks)
	if len(student_action) >= self.min_action_chars:
	yield {
	"state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
	"messages": list(prior_messages), # snapshot
	"student_action": student_action,
	}
	# Append to history regardless (so subsequent turns see it).
	prior_messages.append(chat_msg)
	```

	Notes:
	- We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
	- We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
	- `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
	- Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.

	### 6.1 Smoke-test plan (for Spike 007 itself)

	```python
	ingester = TraceIngester()
	states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
	# Expect roughly 197 states (matches asst-message count counted in §4).
	# Then teacher-replay on the first 5 states, confirm cost is in the
	# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
	```

	Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant economic check for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is required before scaling. Flag this finding in the spike write-up.

	---

	## 7. Open questions for ADR-002

	1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
	2. Should `TraceIngester.ingest()` emit one record per assistant turn (current sketch) or per assistant `tool_use` block within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
	3. Synthetic system prompt at replay time — yes/no? If yes, what content?
	4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
	5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.

	---

	## 8. References (primary sources only)

	Anthropic / Claude Code official:
	- <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
	- <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
	- <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
	- <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license

	Community schemas (reverse-engineered from real session data):
	- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
	- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
	- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
	- <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
	- <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs

	Runners-up reference points:
	- OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
	- SWE-bench experiments: <https://github.com/swe-bench/experiments>
	- SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
	- mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
	- swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
	- Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>

	Internal references:
	- `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
	- Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating