composer-replication-framework / docs /research /TRACE_SOURCE_RECONNAISSANCE.md
Codeseys's picture
Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports
a84c060

TRACE_SOURCE_RECONNAISSANCE.md

Spike 007 trace-source audit, feeding ADR-002.

Status: DECIDED — recommend (a) Claude Code session JSONL (~/.claude/projects/<encoded>/<sessionId>.jsonl).


0. TL;DR

Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has 1,015 .jsonl sessions on this machine today; the eight largest sampled span 550 → 17,315 lines and contain 6,762 multi-turn tool_use messages. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.

The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.


1. Context: TraceExample dataclass field reality

Important correction to the parent task description. The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at /mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py shows the existing types are different — there is no TraceExample class. The closest existing types are two TypedDicts used by replay_trace() and extract_dpo_pairs():

class TraceState(TypedDict):
    state_id: str           # unique within the trace
    messages: list[dict]    # conversation up to and including this step's user prompt
    student_action: str     # what the student actually did at this step

class DPOPair(TypedDict):
    state_id: str
    state_messages: list[dict]
    chosen: str       # teacher-consensus action
    rejected: str     # student action
    n_teachers_agreeing: int

The mapping sketch in §6 below targets TraceState (the input to teacher replay), since that is the type a TraceIngester is upstream of. If Spike 007 also wants a unified TraceExample per the brief, the natural shape is TraceState{teacher_id: str | None, reward: float | None, hint_text: str | None} — flagged for ADR-002 to settle.


2. Candidate audit summary

Scoring legend: + good, ~ mixed, - bad, on each of the four required axes.

# Candidate Schema documented Real ≥5 multi-turn traces Hint-receptive signal density License OK Verdict
a Claude Code JSONL (~/.claude/projects/) ~ Anthropic publishes high-level format note; community schemas are detailed and validated + 1,015 local sessions, 5+ trivially + Per-step assistant.message.content[].tool_use blocks → discrete actions, ideal teacher-correction sites + User-owned local files; framework MIT CHOSEN
b Cline VS Code extension - No published stable export schema ~ Requires running Cline + manual export ~ Plausible if exported but unverified ~ Cline source Apache-2.0 but trace format isn't a stable contract reject
c OpenHands trajectories + Well-documented (events/, base_state.json, Pydantic Event models) - Need to run OpenHands or download eval traces — not zero-cost + ActionEvent/ObservationEvent split is conceptually ideal + OpenHands MIT-licensed strong runner-up
d Aider chat history ~ Format is "markdown, level-4 headings for user input" — fragile ~ Available if Aider was used - Tool calls are flattened into prose; recovering structured actions is lossy + Aider Apache-2.0 reject
e SWE-bench / Lite leaderboard trajs/ - Each submitter chooses a free-form text format (md/json/yaml) + ~hundreds of submissions on github.com/swe-bench/experiments ~ Heterogeneous; structured ones (e.g. mini-swe-agent .traj.json) are good, others are essentially logs + Public submissions with usage rights for research reject as primary; usable as future cross-validation set
f SWE-smith-trajectories on HF + Standard OpenAI messages format, documented per dataset card + 5,017 trajectories, 76,002 rows, public + Single-attempt per-instance SWE-agent runs + Apache-2.0 dataset license strong runner-up; complement, not replacement

The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a different question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us the user's actual workflow. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.


3. Chosen format spec — Claude Code session JSONL

3.1 Location and naming

  • Root: ~/.claude/projects/ (overridable via CLAUDE_CONFIG_DIR). Source: https://code.claude.com/docs/en/sessions ("Transcripts are stored as JSONL at ~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl").
  • Project-key encoding: working-directory absolute path with / and \ and : replaced by -, with a leading -. (Hidden directories with a leading dot become double dashes.) Source: https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md §"Project key encoding".
  • File: <sessionId>.jsonl. Subagent transcripts are agent-<agentId>.jsonl; a SessionReader should skip files starting with agent- when listing main sessions. Source: same claude_skills doc, §"Subagent File Location".
  • Encoding: UTF-8, newline-delimited JSON. One JSON object per line. No [/] wrapping. Local cleanup default 30 days, configurable via cleanupPeriodDays in ~/.claude/settings.json. Source: https://code.claude.com/docs/en/data-usage ("Local caching: Claude Code clients store session transcripts locally in plaintext under ~/.claude/projects/ for 30 days by default to enable session resumption.")

3.2 Common record fields

Every record (both user and assistant types) carries:

field type meaning
parentUuid string | null UUID of the parent record (null on the first record)
uuid string This record's UUID
sessionId string UUID of the session (matches filename)
timestamp string (ISO-8601) Wall-clock time of the record
cwd string Absolute working directory
version string Claude Code version (e.g. "2.1.143")
gitBranch string Empty string "" when not in a git repo
isSidechain boolean True for sub-agent (Task tool) chains
userType string "external" or similar
type string Discriminator — see §3.3
entrypoint string e.g. "sdk-cli"

Sources for these fields:

3.3 Record types (type discriminator)

type Role
user Both human prompts AND tool results (distinguished by message.content[].type)
assistant Model output: text, thinking, and tool_use blocks
system Hook summaries, stop notices
summary Context-compaction markers
attachment Hook stdout/stderr, e.g. SessionStart hook output
queue-operation Prompt enqueue/dequeue events
file-history-snapshot File-state tracking for undo
last-prompt Bookkeeping for resume

Source: https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md §"Entry Types"; corroborated by direct Counter inspection of one local session showing attachment, assistant, user, last-prompt, queue-operation types in expected proportions.

3.4 The two record types we care about

Assistant record carrying a tool call (the "student action")

Real example, redacted from ~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl:

{
  "type": "assistant",
  "uuid": "24a16a51-3133-4ba5-9d23-472864286154",
  "parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
  "sessionId": "39df59f0-…",
  "timestamp": "2026-05-16T04:52:21.947Z",
  "message": {
    "role": "assistant",
    "model": "claude-opus-4-7",
    "content": [
      {
        "type": "tool_use",
        "id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
        "name": "Bash",
        "input": {
          "command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
          "description": "Check builder agent inbox"
        }
      }
    ],
    "stop_reason": "tool_use",
    "usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
  }
}

The student's action at this step = the JSON of message.content[i] where content[i].type == "tool_use" (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the content[i].text of the text block).

User record carrying a tool result (the "observation")

{
  "type": "user",
  "uuid": "b9f9414b-…",
  "parentUuid": "24a16a51-…",            // matches the assistant uuid above
  "sessionId": "39df59f0-…",
  "timestamp": "2026-05-16T04:52:23.229Z",
  "message": {
    "role": "user",
    "content": [
      {
        "tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
        "type": "tool_result",
        "content": "  No new messages",
        "is_error": false
      }
    ]
  },
  "toolUseResult": {                       // duplicate, structured form
    "stdout": "  No new messages",
    "stderr": "",
    "interrupted": false,
    "isImage": false,
    "noOutputExpected": false
  },
  "sourceToolAssistantUUID": "24a16a51-…"  // back-pointer to the assistant uuid
}

User records carrying actual human prompts have message.content as a list with {"type":"text","text":"..."} blocks (or, in older logs, message.content as a plain string).

3.5 Schema stability

  • Anthropic's official documentation acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does not publish a versioned schema.
  • Practical stability: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (toolUseResult). Schema pins additionalProperties: true for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
  • Mitigation: pin to a specific Claude Code version field range and version-gate the ingester (e.g. accept 2.1.x, warn on others).

3.6 Licensing

  • The Claude Code binary is proprietary (Anthropic Commercial Terms of Service, https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md).
  • The session JSONL files are local user data generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
  • Our framework is MIT-licensed and we are not redistributing the Claude Code binary or any third-party trace files. We are reading the user's own local logs (analogous to processing one's own .bash_history).
  • We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the ingester, plus a tiny synthetic-fixture trace for unit tests.

4. Acquiring the 5 real example traces

Zero acquisition cost. All five live on this machine right now.

Discovery command (used during this audit):

find ~/.claude/projects -name "*.jsonl" 2>/dev/null
# → 1015 files

Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:

# Tool-use msgs User msgs Asst msgs Total lines Path
1 2,830 3,199 4,325 17,315 /home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl
2 1,350 1,407 2,016 7,673 /home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl
3 984 1,032 1,549 5,783 /home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl
4 717 759 1,142 4,036 /home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl
5 125 126 197 629 /home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl

(All five inspected programmatically during this audit — counts above are real, not estimates.)

For users on other machines: find ~/.claude/projects -name '*.jsonl' -size +50k | head will surface candidates. For repository CI we will commit a small (~5 KB) synthetic fixture conforming to the schema, never any of the user's real traces.


5. Decision-relevant tradeoffs vs runners-up

Why we are NOT picking OpenHands trajectories (c)

  • Pro: cleanest schema we audited — Pydantic Event / ActionEvent / ObservationEvent models, source: https://docs.openhands.dev/sdk/arch/events, source code: https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py. Tool-call structure is more normalized than Claude Code's (explicit Action/Observation typing).
  • Con: zero-acquisition is false here. Persistence dir defaults to workspace/conversations/ and only exists if the user has run OpenHands locally. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
  • Decisive: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces that already exist. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per https://github.com/All-Hands-AI/OpenHands/issues/8701, which is a flux risk.
  • Future use: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.

Why we are NOT picking SWE-bench leaderboard trajectories (e)

Why we are NOT picking Aider (d)

  • The chat_history_file is markdown (.aider.chat.history.md), per https://aider.chat/docs/config/dotenv.html. Source code at https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py shows it's literally f.write(text) of formatted prose with #### for user input.
  • Decisive: tool calls in Aider are applied as edits, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The .aider.llm.history log is closer to what we want but is opt-in and not always present.

Why we are NOT picking Cline (b)

  • No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.

Why we are NOT picking SWE-smith-trajectories (f)

  • This is the strongest external dataset we found and should be Spike 007's stretch goal / Spike 008's primary: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories.
  • Why not first: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is less signal-dense for the teacher-correction spike than Claude Code's tool_use blocks because the model's name and input fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.

6. TraceIngester sketch

Realised in v0.1 (Wave 17 update): The realised ingester ships at composer_replication/ingestion/claude_code.py exporting ClaudeCodeIngester, with the spike at spikes/007-real-trace-ingestion/claude_code_ingester.py. The public production surface is:

from pathlib import Path
from composer_replication.ingestion.claude_code import ClaudeCodeIngester

ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True)
for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()):
    # trace_state matches the TraceState TypedDict from §1
    ...
stats = ingester.last_stats  # IngestionStats — turn counts, skip reasons

The shipped ClaudeCodeIngester differs from the pre-spike sketch below in:

  • Class name: ClaudeCodeIngester (not TraceIngester)
  • Module path: composer_replication.ingestion.claude_code (not spikes/007-trace-ingester/trace_ingester.py)
  • The constructor takes config kwargs (system_prompt, skip_sidechain, strip_thinking, max_history_tokens); paths are passed to .ingest(Path) per call instead of being held by the ingester
  • The yielded type is TraceState (matches §1)

The pre-spike sketch below is preserved as historical proposal context.

Drop-in adapter for spike-005's replay_trace(). Targets TraceState (the actual existing TypedDict; see §1).

# spikes/007-trace-ingester/trace_ingester.py
from __future__ import annotations
import json
from collections.abc import Iterator
from pathlib import Path
from typing import Any

# Re-use the existing TypedDicts from spike-005:
#   from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState

# A "step" in the trace is each assistant record that ends in tool_use. The
# state visible to the model at that step = all messages strictly before it,
# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).

def _record_to_chat_message(rec: dict) -> dict | None:
    """Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
    dict, or return None for non-conversational records (queue-operation,
    attachment, file-history-snapshot, system, last-prompt, summary)."""
    t = rec.get("type")
    if t not in ("user", "assistant"):
        return None
    msg = rec.get("message")
    if not isinstance(msg, dict):
        return None
    role = msg.get("role")
    content = msg.get("content")
    if role not in ("user", "assistant") or content is None:
        return None
    # Strip thinking blocks — they are not portable across teacher models and
    # should not influence the teacher's decision at replay time.
    if isinstance(content, list):
        content = [c for c in content
                   if not (isinstance(c, dict) and c.get("type") == "thinking")]
    return {"role": role, "content": content}


def _serialize_action(content_blocks: list[dict]) -> str:
    """Canonicalize the student's action at a step.

    For tool_use steps: JSON-encode the (name, input) pairs.
    For text-only steps: return the concatenated text.
    """
    tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
    if tool_uses:
        return json.dumps(
            [{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
            sort_keys=True,
        )
    texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
    return "\n".join(t for t in texts if t)


class TraceIngester:
    """Reads a Claude Code session JSONL and yields TraceState records.

    One TraceState is emitted per assistant record. The `messages` field is the
    full prior conversation (system + alternating user/assistant) up to but not
    including the current assistant turn; `student_action` is the canonicalized
    serialization of that turn's content blocks.
    """

    def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
        self.skip_thinking = skip_thinking
        self.min_action_chars = min_action_chars

    def ingest(self, path: str | Path) -> Iterator[dict]:  # yields TraceState
        path = Path(path)
        prior_messages: list[dict] = []
        session_id_for_state = path.stem  # filename = session UUID

        with path.open("r", encoding="utf-8") as f:
            for line_idx, line in enumerate(f):
                line = line.strip()
                if not line:
                    continue
                try:
                    rec = json.loads(line)
                except json.JSONDecodeError:
                    continue  # tolerate truncated last-line writes

                chat_msg = _record_to_chat_message(rec)
                if chat_msg is None:
                    continue

                if chat_msg["role"] == "assistant":
                    # Emit a TraceState representing "before this turn".
                    blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
                    student_action = _serialize_action(blocks)
                    if len(student_action) >= self.min_action_chars:
                        yield {
                            "state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
                            "messages": list(prior_messages),    # snapshot
                            "student_action": student_action,
                        }
                # Append to history regardless (so subsequent turns see it).
                prior_messages.append(chat_msg)

Notes:

  • We skip thinking blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's _normalize_action.
  • We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via attachment records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
  • state_id = f"{sessionId}:{recordUuid}" is globally unique and stable across re-ingest.
  • Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method ingest_with_stats(path) is a small follow-up.

6.1 Smoke-test plan (for Spike 007 itself)

ingester = TraceIngester()
states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
# Expect roughly 197 states (matches asst-message count counted in §4).
# Then teacher-replay on the first 5 states, confirm cost is in the
# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).

Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant economic check for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is required before scaling. Flag this finding in the spike write-up.


7. Open questions for ADR-002

  1. Do we promote TraceState to a top-level TraceExample dataclass, with optional teacher_id, reward, hint_text? Or keep TraceState as ingester output and DPOPair as trainer input, treating the brief's "TraceExample" as conceptual?
  2. Should TraceIngester.ingest() emit one record per assistant turn (current sketch) or per assistant tool_use block within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
  3. Synthetic system prompt at replay time — yes/no? If yes, what content?
  4. Trace-version pinning: hard-fail or warn when version field falls outside a known-tested range?
  5. Subagent transcripts (agent-*.jsonl) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.

8. References (primary sources only)

Anthropic / Claude Code official:

Community schemas (reverse-engineered from real session data):

Runners-up reference points:

Internal references:

  • spikes/005-integrated-trainer-skeleton/teacher_replay.pyTraceState, DPOPair, replay_trace, extract_dpo_pairs (read in full during this audit; see §1 for actual field list)
  • Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating