Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs

Imports the gap-closer wave from VISION_VALIDATION.md as a structured backlog,
records the deep work loop log, and locks three architecture decisions backed
by primary-source research from three subagent recons.

Backlog items (CPU-only, no GPU budget):
- Spike 006 — real HF model smoke (Wave 7 next)
- Spike 007 — real trace ingestion (Wave 8)
- Spike 008 — Streaming DiLoCo smoke (Wave 9)
- Wave 10 — packaging (pyproject.toml + examples/)
- Spike 002a-mini — Modal-gated GPU smoke (Phase 10)

Research (docs/research/, 1168 lines total):
- MODAL_RECONNAISSANCE.md — Modal pricing + setup, primary-sourced from
modal.com/pricing and modal.com/docs. Verdict: Modal L4 is $0.08-0.13 per
smoke run but loses to local 5090 by 10x on iteration cycle (3-5min vs
25-40s). Modal becomes correct for parallel sweeps, 7B+ models, multi-node
training, or CI repro — none of which apply to gap-closer wave.
- DILOCO_RECONNAISSANCE.md — audited 5 candidates. meta-pytorch/torchft wins:
BSD-3, 312 commits, HEAD 2026-04-03, library-not-research-code, prebuilt
wheels, single-process unit-testable via MagicMock(Manager) pattern. Their
DiLoCo class IS the Streaming generalization (vanilla = single fragment).
Sign-convention mismatch flagged for explicit test in Spike 008.
- TRACE_SOURCE_RECONNAISSANCE.md — corrected the existing dataclass (it's
TraceState TypedDict, not TraceExample). Recommended Claude Code session
JSONL: 1015 local sessions on this machine, zero acquisition cost, 6762
tool_use messages across 5 pre-selected sessions, schema validated by 4
independent community projects + JSON Schema validated against ~50000 real
messages.

ADRs (docs/adrs/):
- ADR-001 — GPU venue: local 5090. Modal stashed for parallel sweeps and 7B+
workloads. Migration path documented.
- ADR-002 — Trace source: Claude Code session JSONL. Pattern opens door for
OpenHands and SWE-smith ingesters in v0.2.
- ADR-003 — DiLoCo impl: torchft.local_sgd.DiLoCo with shared-buffer mock
allreduce for Spike 008 single-process test. Sign-convention mismatch
caught with explicit unit test.

Deep work loop log: docs/DEEP_WORK_LOOP_LOG.md tracks all 12 phases with
status. Phases 1-4 complete; phase 5 (planning waves 7-10) is next.

Tests still 38/38 green; no code changes in this wave.

Refs: docs/VISION_VALIDATION.md gaps V2/V4/V5/V8.

Files changed (8) hide show

BACKLOG.md +92 -0
docs/DEEP_WORK_LOOP_LOG.md +47 -0
docs/adrs/ADR-001-gpu-venue.md +102 -0
docs/adrs/ADR-002-trace-source.md +131 -0
docs/adrs/ADR-003-diloco-impl.md +100 -0
docs/research/DILOCO_RECONNAISSANCE.md +357 -0
docs/research/MODAL_RECONNAISSANCE.md +408 -0
docs/research/TRACE_SOURCE_RECONNAISSANCE.md +403 -0

BACKLOG.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# Backlog — Composer 2.5 Replication Framework
+Imported from `docs/VISION_VALIDATION.md` § 6 (gaps) + § 9 (gap-closers) at 2026-05-26.
+## Active items (CPU-only, no GPU budget)
+### Spike 006 — Real HF model smoke (Wave 7)
+**Closes**: V8 ("any HF model") — currently we run only mock 4-layer toy LM through `composer_total_loss`.
+**Goal**: prove the 3-channel loss (`grpo + α·sdpo_kl + β·trace_replay_dpo`) survives a real `transformers` model + tokenizer with finite gradients and a decreasing loss across N steps.
+**Acceptance**:
+1. `AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")` loads on CPU.
+2. Real tokenizer `apply_chat_template` produces `input_ids` shape that flows through `composer_total_loss(model, batch)` without mock shapes.
+3. 5 backward steps run on CPU without `nan` / `inf` / shape mismatch.
+4. Loss is monotone non-increasing across 5 steps (trend; allow noise).
+5. New tests added under `spikes/006-real-hf-model-smoke/tests/` pass alongside existing 38.
+**Estimate**: half a day, CPU only.
+### Spike 007 — Real trace ingestion (Wave 8)
+**Closes**: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces."
+**Goal**: pick ONE real agent-session log format with stable, public schema, write a `TraceIngester` that converts it to our `TraceExample` dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states.
+**Acceptance**:
+1. ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories).
+2. `TraceIngester.ingest(path: Path) -> Iterator[TraceExample]` is implemented + has unit tests with a fixture log file.
+3. End-to-end smoke: real trace → ingester → collator → 1-step `composer_total_loss` runs without error.
+4. Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to `spikes/007-*/verdict.md`.
+**Estimate**: 1 day + ~$2 OpenRouter.
+### Spike 008 — Streaming DiLoCo smoke (Wave 9)
+**Closes**: V2 (DiLoCo "deferred to v0.2" — drift from original brief).
+**Goal**: bolt outer-loop pseudo-gradient sync onto the loss composition test using two `nn.Module` replicas on the same node. No real distributed training (CPU multiprocessing or single-process).
+**Acceptance**:
+1. ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo).
+2. `outer_optimizer.py` implements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step.
+3. Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance.
+4. 38 existing tests still pass (no regression).
+**Estimate**: 2 days, CPU.
+### Wave 10 — Packaging
+**Closes**: V4 ("skeleton not framework").
+**Goal**: turn the assemblage of spike directories into an installable Python package with a clear quickstart.
+**Acceptance**:
+1. `pyproject.toml` at repo root, package name `composer_replication`.
+2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
+3. `examples/qwen3_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
+4. README quickstart updated to `pip install -e .` + `python examples/qwen3_05b_quickstart/run.py`.
+5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
+**Estimate**: half a day, CPU.
+## Modal-gated (if budget allows after gap-closers)
+### Spike 002a-mini — Real GPU smoke (Phase 10)
+**Closes**: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only.
+**Goal**: dispatch a 30-min A10G smoke on Modal that runs Spike 006 unchanged on GPU, verifies bf16 numerics, captures memory + step-time.
+**Acceptance**:
+1. ADR-001 says Modal is the right choice for this workload + estimate is < $5.
+2. Modal app builds, runs `composer_total_loss` for 50 steps on Qwen2.5-0.5B-Instruct.
+3. Loss curve + memory profile saved to `spikes/002a-mini/` and pulled to local.
+4. No new shape / dtype bug surfaced vs CPU run.
+**Estimate**: $1–3, 30 min wall-clock.
+## Deferred (post-loop, GPU-gated)
+- Spike 002a/002b — full trace collection on A100 ($30–50)
+- Spike 003 — DPO-pair signal density study
+- Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0
+- Publication wave — author identity, thumbnail, X tags, post sequence
+## Process notes
+- Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks.
+- Each spike has its own `spikes/00N-name/` dir + `verdict.md` recording acceptance + delta from estimate.
+- Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs.

docs/DEEP_WORK_LOOP_LOG.md ADDED Viewed

	@@ -0,0 +1,47 @@

+# Deep Work Loop Log — Composer 2.5 Replication Framework
+Started: 2026-05-26
+Operator: Codeseys (Hermes Agent autonomous loop)
+Skill: `deep-work-loop` v1.0.0
+## Vision
+> Take any HuggingFace model → further RL train it using:
+> 1. RLVR (tests-pass reward),
+> 2. SDPO/hint-distillation (Composer 2.5's "targeted RL with textual feedback"),
+> 3. multi-teacher trace-replay DPO,
+> integrated against TRL/VeRL/OpenEnv with DiLoCo-style outer loop sync.
+>
+> Output: a published, reproducible framework — the "Composer 2.5 replication" the open ecosystem is missing.
+## Starting state
+- HEAD: `040eff8` (Wave 6: vision validation self-audit, 5/10 scorecard)
+- Tests: 38/38 green in `spikes/005-integrated-trainer-skeleton/`
+- Working tree: clean
+## Phase ledger
+| Phase | Description | Status | Started | Done |
+|---|---|---|---|---|
+| 1 | commit-state | ✅ | 2026-05-26 | 2026-05-26 |
+| 2 | backlog-audit (BACKLOG.md from VISION_VALIDATION) | ✅ | 2026-05-26 | 2026-05-26 |
+| 3 | parallel-research (3 subagents) | 🟡 | 2026-05-26 |  |
+| 4 | architect with ADRs (ADR-001..003) | ⏳ |  |  |
+| 5 | plan in waves (W7–W10) | ⏳ |  |  |
+| 6 | execute W7 — Spike 006 (real HF model smoke) | ⏳ |  |  |
+| 7 | execute W8 — Spike 007 (real trace ingestion) | ⏳ |  |  |
+| 8 | execute W9 — Spike 008 (DiLoCo smoke) | ⏳ |  |  |
+| 9 | execute W10 — packaging | ⏳ |  |  |
+| 10 | (Modal-gated) Spike 002a-mini real GPU smoke | ⏳ |  |  |
+| 11 | cross-model-final-review | ⏳ |  |  |
+| 12 | update scorecard + push | ⏳ |  |  |
+## Constraints
+- Verify ALL claims against primary sources (Wave 2 lesson — subagent synthesis is not evidence).
+- Tests must pass before commit.
+- Memory L1 is at 99% — write to L2 wiki + L3 fact_store, not L1.
+- Modal budget: $20 hard cap for this loop. Anything more goes to user for approval.
+- No `upload_file` mixing with `git push` — `git push hf master:main` only.
+- Commit messages via `-F /tmp/<wave>-commit-msg.txt`.

docs/adrs/ADR-001-gpu-venue.md ADDED Viewed

	@@ -0,0 +1,102 @@

+# ADR-001 — GPU venue for Spike 002a-mini smoke
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: Phase 4 (deep work loop)
+## Context
+Spike 002a-mini is the optional Phase-10 gate in the deep work loop: take the
+real-HF-model loss-composition smoke (Spike 006) and run it on GPU to confirm
+bf16 numerics, capture memory + step-time, and rule out CPU-only blind spots
+before publishing the framework.
+The user has:
+- a 5090 (32 GB VRAM, Blackwell) on the local box (this WSL host)
+- a configured Modal account (~/.modal.toml present, modal CLI installed)
+The workload:
+- `Qwen/Qwen2.5-0.5B-Instruct` (~1 GB bf16 weights)
+- ~50 forward+backward steps through the 3-channel loss
+- single GPU, no distributed training, no FSDP
+## Options considered
+### Option A — Local 5090
+- Free, no rate limit, no cold start.
+- Iteration loop: code change → run → fix → run is **~25-40 s wall-clock per cycle**.
+- 32 GB VRAM ≫ 24 GB needed for this workload.
+- WSL CUDA path is the same one we use for eidolon training already; toolchain proven.
+- Reuses local HF cache (~/.cache/huggingface), no re-download per run.
+### Option B — Modal L4 ($0.000222/sec ≈ $0.799/hr)
+- $0.08-0.13 per smoke run (3-7 min wall-clock incl cold start).
+- Iteration loop: code change → modal-run dispatch → image build (cached) → cold
+  start → run → modal volume get → fix is **~3-5 min per cycle even on a cache hit**.
+- Persistent volume saves model re-download across runs.
+- Decoupled from local environment state.
+- Extensively documented gotchas in `mlops/modal-llm-training` skill (M1-M9).
+### Option C — Modal A100-40GB
+- ~3× cost of L4 for 0.5B workload that doesn't need the capacity. Ruled out.
+## Decision
+**Option A — local 5090.** The 5090 dominates Modal L4 on every dimension that
+matters for a 0.5B sub-1B-param verification smoke:
+| Dimension | 5090 (local) | Modal L4 |
+|---|---|---|
+| Iteration cycle | 25-40 s | 3-5 min (10× slower) |
+| $ / smoke run | $0 | $0.10 |
+| VRAM headroom | 32 GB > 24 GB needed | 24 GB ≈ 24 GB needed |
+| State decoupling | Same machine as dev | Decoupled (advantage Modal) |
+| Toolchain risk | Already proven | New for this workload |
+The "decoupled state" advantage of Modal is real but doesn't outweigh the 10×
+iteration penalty for what is fundamentally a verification step. We're not
+running production training; we're checking that a GPU run agrees with the
+CPU run we just did.
+## Consequences
+### Accepted
+- Spike 002a-mini becomes a **local 5090** smoke, not a Modal job.
+- The `mlops/modal-llm-training` skill's L4 pattern (modal_app.py skeleton in
+  `docs/research/MODAL_RECONNAISSANCE.md`) is **stashed for future use** — it's
+  the right pattern when we DO need cloud GPU.
+- `docs/research/MODAL_RECONNAISSANCE.md` stays in the repo as the design
+  document for the Modal path; the file documents *why* we didn't use Modal
+  for this smoke and *when* Modal becomes correct.
+### Modal becomes the right choice when
+1. **Parallel parameter sweeps** — N independent runs across α, β, lr, etc.
+   that need to fan out faster than wall-clock-sequential on a single 5090.
+2. **Scaling to ≥7B base models** — 5090's 32 GB starts to bind on 7B + LoRA
+   + activation memory at seq 4096+. A100-40 or H100 becomes necessary.
+3. **Multi-node training** — DiLoCo-style outer-loop across 2+ physical
+   nodes for the eventual full RL run.
+4. **CI / reproducibility** — a future contributor wants to repro our results
+   without owning a 5090.
+These are all **post-replication** workloads. The deep work loop's gap-closer
+phase (W7-W10) doesn't need any of them.
+### Trade-offs explicitly accepted
+- We carry one local-environment dependency (WSL CUDA + the 5090 driver) that
+  Modal would have absorbed. Mitigated by: the same dependency is already
+  exercised by eidolon training, so the marginal risk is zero.
+- We don't get an audit-friendly "Modal app run with persistent receipts"
+  artifact. Mitigated by: capturing `nvidia-smi` snapshots + step-time CSV
+  into `spikes/006-real-hf-model-smoke/results/` as our local audit trail.
+## Source
+`docs/research/MODAL_RECONNAISSANCE.md` (subagent recon, primary-sourced from
+modal.com/pricing and modal.com/docs, 2026-05-26).

docs/adrs/ADR-002-trace-source.md ADDED Viewed

	@@ -0,0 +1,131 @@

+# ADR-002 — Trace source for Spike 007 (real LLM-application traces)
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: Phase 4 (deep work loop)
+## Context
+Spike 007 closes V5 of the vision validation: "real LLM-application traces."
+Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement.
+The framework's brief explicitly said *real traces*, so we owe Spike 007 a
+primary-sourced ingestion path that converts a real, public, multi-turn agent
+trace format into our existing `TraceState` TypedDict.
+Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`):
+```python
+class TraceState(TypedDict):
+    state_id: str           # unique within the trace
+    messages: list[dict]    # OpenAI-style conversation up to + incl this step
+    student_action: str     # what the student did at this step
+```
+(Earlier deep-work-loop notes called this `TraceExample` — that was a brain
+glitch; the actual type is `TraceState` and there is no `TraceExample`.)
+## Options considered
+| Option | Schema | Acquisition | Signal density | License |
+|---|---|---|---|---|
+| (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT |
+| (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned |
+| (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT |
+| (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 |
+| (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) |
+| (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT |
+Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon).
+## Decision
+**Option (a) — Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionid>.jsonl`.
+Wins on every axis we care about for Spike 007:
+1. **Acquisition cost: zero.** 1,015 real sessions already on this machine
+   from the user's daily Claude Code use. No download, no consent
+   negotiation, no rate limiting, no schema change risk during ingestion
+   development.
+2. **Schema stability: empirically validated.** The subagent ran a programmatic
+   audit on 8 real sessions; record types are stable across all of them.
+   Anthropic publishes user-facing docs for the format; four independent
+   community projects (claude-code-cli-tools, claudeflow, etc.) ship
+   working parsers including one with a JSON Schema validated against
+   ~50,000 real messages.
+3. **Signal density: maximal.** Every `tool_use` block is a candidate
+   teacher-correction site. The 5 pre-selected sessions in the recon doc
+   contain 6,762 tool_use messages (range 125 → 2,830 per session). That's
+   100× the density of Spike 001's 50 synthetic states.
+4. **License: clean.** The trace files are user-owned files on the user's
+   own machine. We don't redistribute them with the framework. The
+   *ingester* code we write is MIT and ships in the framework. Anyone
+   running the framework who wants real-trace ingestion uses their own
+   local Claude Code sessions.
+## Consequences
+### Accepted
+- Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]`
+  for the Claude Code JSONL format.
+- The TraceIngester ships as part of the package (Wave 10 packaging) under
+  `composer_replication.ingestion.claude_code`.
+- The recon doc's 5 pre-selected real sessions become the **smoke fixture**
+  for Spike 007's tests. We pin to a known set of session IDs so the test
+  is deterministic locally; CI users substitute their own.
+- `ingestion/` directory pattern is established now to support adding
+  ingesters for OpenHands and SWE-smith later if Spike 007 reveals
+  signal-density gaps.
+### Open questions resolved by ADR-002
+1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`).
+   A single assistant turn often emits multiple `tool_use` blocks for one
+   reasoning step; treating each tool_use as a separate state would
+   over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE
+   §5.
+2. **`student_action` mapping** — The literal text of the assistant turn
+   (concatenated `text` blocks of the Claude message) becomes
+   `student_action`. The teacher-replay channel asks N teachers to produce
+   their version of "what should the assistant do here?" given the
+   `messages` history; we then DPO-compare teacher consensus vs literal
+   student text.
+3. **Thinking blocks** — Strip `thinking` blocks from the message history
+   passed to teachers (teachers don't have access to Claude's reasoning
+   trace). KEEP them in the `student_action` for the student's own
+   reproduction loop, since that's the actual generation we'd be RL-training.
+4. **System prompt** — Inject a synthetic system prompt at message[0] of
+   each `TraceState` describing "you are a coding agent" so teachers
+   without their own coding-agent system prompt have a fair playing field.
+5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions.
+   Subagent traces have a different structure (parent task ID etc.) that
+   would complicate the v0.1 ingester.
+### Recon-flagged risk (not blocking)
+- Anthropic doesn't publish a versioned schema. The TraceIngester pins to
+  known record-types as of 2026-05-26 and gracefully degrades on unknown
+  types. If Anthropic ships a breaking change to the JSONL format, we'd
+  need to bump a `schema_version` constant in the ingester. Acceptable
+  ongoing maintenance burden.
+### Future ingesters
+Open the door for two more ingesters in v0.2:
+- `composer_replication.ingestion.openhands` — for users who run OpenHands
+- `composer_replication.ingestion.swe_smith` — for users who download the HF dataset
+Both follow the same `Iterator[TraceState]` contract.
+## Source
+`docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced
+including direct inspection of the user's local sessions, 2026-05-26).

docs/adrs/ADR-003-diloco-impl.md ADDED Viewed

	@@ -0,0 +1,100 @@

+# ADR-003 — DiLoCo implementation choice for Spike 008
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: Phase 4 (deep work loop)
+## Context
+Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
+v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
+real working integration, not a hand-rolled toy.
+The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
+inner steps, apply Nesterov-momentum outer step across replicas. We need a
+PyTorch-compatible reference implementation that runs in single-process for
+unit tests AND scales out on torch.distributed when we eventually run real
+multi-replica training.
+## Options considered
+| Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
+|---|---|---|---|---|---|
+| `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) |
+| `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing |
+| `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly |
+| `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
+| DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |
+Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).
+## Decision
+**`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active,
+single-process testable).
+Rationale:
+1. **Library, not research code.** Proper packaging on PyPI with prebuilt
+   wheels (`pip install torchft-nightly`), real test suite, version history,
+   maintained by Meta. The other live candidates are research codebases that
+   break on torch version bumps.
+2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts
+   `model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
+   `model_fragments=[model]` (single fragment, full-model sync) for vanilla
+   DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
+   to choose at the API level — both modes are one parameter apart.
+3. **Single-process unit-testable.** torchft's own tests use
+   `MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
+   shared-buffer mock allreduce that does real averaging across two
+   in-process replicas. Verified working pattern in the recon doc.
+4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
+   `perform_sync` (line 423).** Direct extension point — we can subclass or
+   monkey-patch these to compose with our Composer trainer.
+### Risks accepted (with mitigations)
+| Risk | Mitigation |
+|---|---|
+| **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. |
+| **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
+| **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). |
+| **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |
+## Consequences
+### Accepted
+- Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
+  ready-to-paste pytest pattern as the smoke:
+  - 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
+  - shared-buffer mock allreduce (no NCCL)
+  - assertions: replica equality after sync, params actually moved, Nesterov
+    state populated, sync count matches expected
+- `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
+  with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
+  versioned wheel.
+- Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
+  v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
+- The sign-convention mismatch is **made explicit** in our wrapper code with
+  a unit test that catches a sign flip if torchft ever inverts it.
+### Rejected paths
+- **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test
+  surface for distributed-correctness is large; reusing a Meta-maintained
+  library cuts the audit burden.
+- **`diloco_simple`.** Disqualified by the license absence alone.
+- **`prime` / INTELLECT-1.** Right tool for production multi-node runs,
+  wrong tool for a single-process unit test.
+## Source
+`docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
+from torchft repo cloned + read locally, 2026-05-26).

docs/research/DILOCO_RECONNAISSANCE.md ADDED Viewed

	@@ -0,0 +1,357 @@

+# DiLoCo Reference Implementation Reconnaissance
+**Date:** 2026-05-25
+**Purpose:** Pick ONE PyTorch reference implementation of (Streaming) DiLoCo to bolt onto
+the composer-replication-framework outer-loop optimizer. Feeds ADR-003.
+**Bias:** simple + working > fancy + theoretically-better. Library > research codebase.
+---
+## TL;DR — Recommendation
+**Use `meta-pytorch/torchft`'s `torchft.local_sgd.DiLoCo` context manager.**
+It is a maintained library (not a research codebase), BSD-3 licensed, supports both
+vanilla DiLoCo and Streaming DiLoCo through one class, and — critically — is unit-testable
+in a single process by passing a `MagicMock(Manager)` whose `allreduce` returns a `_DummyWork`.
+Their own `torchft/local_sgd_test.py` already demonstrates the exact pattern Spike 008 needs.
+The Streaming DiLoCo paper (Liu et al. 2025, arXiv:2501.18512) has no separate community
+implementation — torchft *is* the reference implementation as of mid-2026. PrimeIntellect's
+two repos are either too minimal (`diloco_simple`, no LICENSE, NCCL-locked, no Streaming) or
+deprecated (`OpenDiLoCo`, hivemind-based, "no longer maintained" per its own README).
+---
+## Candidates Audited (primary sources only)
+### A1. PrimeIntellect-ai/diloco_simple
+- URL: https://github.com/PrimeIntellect-ai/diloco_simple
+- License: **NONE** (no LICENSE file in repo — confirmed via `git clone` + `ls`).
+  All-rights-reserved by default under copyright law. **Cannot legally vendor or fork.**
+- Last commit: **2024-05-31** (`be38ec4 add weight decay`).
+- Activity: 8 commits total, ever. Two main authors. Effectively abandoned.
+- Shape: single 180-LOC research script (`pure_torch_diloco.py`), pedagogical demo.
+- Streaming DiLoCo? **No.** Vanilla DiLoCo only.
+- Distributed: **Hard-coded NCCL via `torchrun`** + `init_process_group(backend="nccl")`.
+  Pulls in `wandb`, `transformers`, HuggingFace `datasets`, `cyclopts`, and trains a
+  full LlamaForCausalLM on C4. Not a library — a benchmark script.
+- Verdict: **REJECT.** No license, no Streaming, no library API, NCCL-only, deps on
+  HF + wandb just to run. Useful as an *algorithm reference*, not as code to depend on.
+### A2. PrimeIntellect-ai/OpenDiLoCo
+- URL: https://github.com/PrimeIntellect-ai/OpenDiloco
+- License: present (Apache-2.0 typical, not re-verified — moot, see below).
+- Status: **Officially deprecated.** README first paragraph:
+  > "**Important Notice**: OpenDiLoCo is no longer maintained. For our production-ready
+  > distributed training solution, please check out `prime`."
+- Built on: `hivemind` (DHT-based decentralized training). Multi-machine only.
+- Streaming DiLoCo? No.
+- Verdict: **REJECT.** Deprecated by its authors. Hivemind dependency would force us to
+  set up DHT initial peers just to run a unit test.
+### A3. PrimeIntellect-ai/prime (a.k.a. INTELLECT-1 framework)
+- URL: https://github.com/PrimeIntellect-ai/prime — note: the GitHub org now uses this
+  repo for their CLI/SDK; the original training framework was rebranded.
+- The actual INTELLECT-1 training code uses an `ElasticDeviceMesh` abstraction and is
+  a full distributed training stack, not an algorithm library.
+- Verdict: **REJECT.** Production framework, not a drop-in library. Coupling a 1.5k-LOC
+  fault-tolerant elastic mesh into our test framework is the opposite of "simple + working".
+### A4. DeepMind reference implementation (Douillard et al., arXiv:2311.08105)
+- **No public reference implementation exists.** The DiLoCo paper is algorithm-only.
+  Confirmed: paper has no associated GitHub link in arXiv abstract or PDF; HuggingFace
+  papers page links no code. DeepMind has not open-sourced their internal trainer.
+- Verdict: **N/A — does not exist.**
+### A5. meta-pytorch/torchft  ← **CHOSEN**
+- URL: https://github.com/meta-pytorch/torchft
+- License: **BSD 3-Clause** (verified: `head -5 LICENSE` → "BSD 3-Clause License").
+- Last commit on main: **2026-04-03** (HEAD `7eb7087 Add torchcomms ProcessGroup shim
+  for fault-tolerant reconfiguration`).
+- Activity: 312 commits, multiple Meta contributors, recent commits across 2025 and 2026,
+  active CI, nightly PyPI builds at https://pypi.org/project/torchft-nightly/.
+- Shape: **library**, not a research codebase. `torchft/` is a proper Python package with
+  `local_sgd.py`, `manager.py`, `process_group.py`, `local_sgd_test.py` (real pytest unit
+  tests), pyproject.toml, BSD-3.
+- Streaming DiLoCo? **Yes** — the `DiLoCo` class is itself a Streaming DiLoCo
+  generalization (`fragment_sync_delay`, `fragment_update_alpha`); pass a single-element
+  `model_fragments=[model]` for vanilla DiLoCo.
+- Source comment confirms: `"""... DiLoCo paper: https://arxiv.org/pdf/2311.08105 /
+  Streaming DiLoCo paper: https://arxiv.org/pdf/2501.18512 """`
+---
+## Deep Dive: torchft (the chosen one)
+### (1) Repo metadata
+| Field | Value |
+|---|---|
+| URL | https://github.com/meta-pytorch/torchft |
+| License | BSD 3-Clause |
+| HEAD commit | `7eb7087` (2026-04-03) |
+| Total commits on main | 312 |
+| Activity level | **Active** — commits in 2025 + 2026, Meta-maintained, PyPI nightly builds |
+| Distribution | `pip install torchft-nightly` (prebuilt wheels) **OR** install from source (requires Rust + protobuf-compiler + maturin — only because of the Lighthouse/process-group Rust ext, not the algorithm code) |
+| Python | `requires-python = ">=3.8"`; `torch>=2.7` per `pyproject.toml` |
+### (2) Exact API / extension point
+The integration target is `torchft/local_sgd.py`. Two relevant classes:
+```python
+# Public class — drop-in context manager
+class DiLoCo:
+    def __init__(
+        self,
+        manager: Manager,                                  # we mock this
+        model_fragments: List[nn.Module],                  # [model] for vanilla DiLoCo
+        inner_optimizer: optim.Optimizer,
+        outer_optimizer: optim.Optimizer | list[optim.Optimizer],
+        sync_every: int,                                   # N inner steps
+        backup_device: Optional[torch.device] = None,
+        pin_memory: bool = True,
+        use_bucketization: bool = False,
+        bucket_cap_mb: Optional[int] = None,
+        should_quantize: bool = False,
+        fragment_sync_delay: int = 0,                      # τ in Streaming DiLoCo paper
+        fragment_update_alpha: float = 0.0,
+    ) -> None: ...
+```
+The **pseudo-gradient** is computed in `_StreamingDiLoCoFragment._save_grads()`
+(`torchft/local_sgd.py` line 324):
+```python
+def _save_grads(self) -> None:
+    """Saves pseudo-gradients of the parameters"""
+    with torch.no_grad():
+        for name, p in self._model_fragment.named_parameters():
+            local_param = p.to_local() if isinstance(p, DTensor) else p
+            pseudogradient = self.original_parameters[name].to(p.device) - local_param
+            self._grads[name] = pseudogradient
+```
+Note the **sign**: `original − local` (i.e. `θ_initial − θ_local`). When this is later
+copied into `p.grad` via `_set_grads`, an SGD step `p ← p − lr · grad` becomes
+`p ← θ_initial − lr · (θ_initial − θ_local)` = a step *toward* `θ_local`. Our spec
+says δ = θ_local − θ_initial; torchft uses the negation. Either convention works as
+long as the outer optimizer's lr sign is consistent — torchft uses positive `outer_lr`
+(e.g. 0.7) and SGD which subtracts the grad, so the math nets out. **Be careful when
+unit-testing the sign in Spike 008.**
+The **outer Nesterov step** is in `_StreamingDiLoCoFragment.perform_sync()` (line 423):
+```python
+if should_commit:
+    self._set_grads()                  # write pseudogradient into p.grad
+    self._outer_optimizer.step()       # Nesterov SGD step (user-provided)
+    self.save_parameters()
+    self._merge_parameters()
+self._outer_optimizer.zero_grad()
+```
+The Nesterov-ness lives in the user-provided outer optimizer, e.g.:
+```python
+outer_optimizer = torch.optim.SGD(model.parameters(), lr=0.7, momentum=0.9, nesterov=True)
+```
+This matches the DiLoCo paper exactly (Douillard §3 specifies Nesterov momentum outer).
+The cross-replica all-reduce happens in `_average_grads()` (called from `prepare_sync`)
+via `self._manager.allreduce(...)` — which is the seam we mock for single-process tests.
+### (3) torch.distributed dependency for testing?
+**No, not for unit tests.** The `Manager` is mockable. From `torchft/local_sgd_test.py`:
+```python
+from unittest.mock import create_autospec, MagicMock
+from torchft.manager import Manager
+from torchft.work import _DummyWork
+def create_manager() -> MagicMock:
+    manager = create_autospec(Manager)
+    manager.errored.return_value = None
+    def mock_allreduce(tensor: torch.Tensor, should_quantize: bool = False):
+        return _DummyWork(tensor)        # returns the same tensor unchanged
+    manager.allreduce.side_effect = mock_allreduce
+    return manager
+```
+This bypasses NCCL/Gloo entirely. `_DummyWork` just wraps the tensor and returns it as
+the "all-reduced" result, so a single-process test with `world_size=1` works directly,
+and a 2-replica test is achieved by running two `DiLoCo` instances with two model
+copies in the same process and a `mock_allreduce` that *averages* the two tensors
+manually before returning. (Their `test_bucketization_correctness` does exactly this.)
+For real distributed runs torchft uses Gloo or NCCL via `torchft.process_group`
+(reconfigurable PGs that wrap `torch.distributed`). We do not need this for Spike 008.
+### (4) Library, research codebase, or paper-companion?
+**Library.** Strong evidence:
+- Proper Python package layout (`torchft/__init__.py`, modules per concern).
+- Real unit tests (`*_test.py` per module) — not "run this script" demos.
+- BSD-3-Clause LICENSE (vs. diloco_simple having none, signaling "personal demo").
+- Nightly PyPI distribution (`torchft-nightly`) with prebuilt wheels.
+- Documentation site at https://pytorch.org/torchft.
+- `meta-pytorch` org — Meta-internally maintained; lives next to `torchtitan`.
+- README explicitly: *"torchft is designed to provide the primitives required to
+  implement fault tolerance in any application/train script"* — i.e. a building block.
+Only friction: installing **from source** needs Rust (pyo3 + maturin) and
+protobuf-compiler. This is for the Rust Lighthouse/process-group extension which we
+**do not need** for Spike 008's mock-based tests. Two clean options:
+- (a) `pip install torchft-nightly` — uses prebuilt wheel, no Rust toolchain needed.
+- (b) Vendor `torchft/local_sgd.py` + the few helpers (`work.py::_DummyWork`,
+  type stubs for `Manager`) into our repo under BSD-3 attribution. ~700 LOC total.
+### (5) Minimum viable test pattern for Spike 008
+Goal: **2 replicas × 4 inner steps × 2 outer rounds on a tiny model**, single-process, no NCCL.
+```python
+# spikes/008-diloco-outer-loop/tests/test_diloco_two_replicas.py
+"""
+Spike 008: prove the DiLoCo outer-loop math is correct under our framework.
+Runs entirely in a single process, no torch.distributed required.
+"""
+import copy
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from unittest.mock import create_autospec, MagicMock
+from torchft.local_sgd import DiLoCo
+from torchft.manager import Manager
+from torchft.work import _DummyWork
+class TinyMLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 2))
+    def forward(self, x): return self.net(x)
+def _make_avg_manager(replica_buffer):
+    """Manager whose allreduce averages tensors across replicas via shared buffer."""
+    mgr = create_autospec(Manager)
+    mgr._use_async_quorum = False
+    mgr.errored.return_value = None
+    mgr.should_commit.return_value = True
+    mgr.current_step.return_value = 0
+    def avg_allreduce(tensor, should_quantize=False):
+        # Cross-replica average: stash and average against the other replica's tensor
+        replica_buffer.append(tensor.clone())
+        if len(replica_buffer) == 2:
+            mean = (replica_buffer[0] + replica_buffer[1]) / 2.0
+            tensor.copy_(mean)
+            replica_buffer.clear()
+        return _DummyWork(tensor)
+    mgr.allreduce.side_effect = avg_allreduce
+    return mgr
+def test_diloco_two_replicas_four_inner_two_outer():
+    torch.manual_seed(0)
+    model_a = TinyMLP()
+    model_b = copy.deepcopy(model_a)              # identical init = same θ_initial
+    # Inner optimizers (one per replica)
+    inner_a = optim.AdamW(model_a.parameters(), lr=1e-3)
+    inner_b = optim.AdamW(model_b.parameters(), lr=1e-3)
+    # Outer Nesterov (one per replica, same hyperparams)
+    outer_a = optim.SGD(model_a.parameters(), lr=0.7, momentum=0.9, nesterov=True)
+    outer_b = optim.SGD(model_b.parameters(), lr=0.7, momentum=0.9, nesterov=True)
+    # Shared buffer — both DiLoCo wrappers funnel through one "process group" of size 2
+    buf = []
+    mgr_a = _make_avg_manager(buf)
+    mgr_b = _make_avg_manager(buf)
+    SYNC_EVERY = 4   # 4 inner steps per outer round
+    OUTER_ROUNDS = 2
+    with DiLoCo(mgr_a, [model_a], inner_a, outer_a, sync_every=SYNC_EVERY) as dla, \
+         DiLoCo(mgr_b, [model_b], inner_b, outer_b, sync_every=SYNC_EVERY) as dlb:
+        # Snapshot θ_initial
+        theta_initial_a = {n: p.detach().clone() for n, p in model_a.named_parameters()}
+        for outer_round in range(OUTER_ROUNDS):
+            for inner_step in range(SYNC_EVERY):
+                # Replicas see DIFFERENT data — that is the whole point of DiLoCo
+                x_a = torch.randn(8, 4) + 0.1 * outer_round
+                x_b = torch.randn(8, 4) - 0.1 * outer_round
+                y_a, y_b = torch.randn(8, 2), torch.randn(8, 2)
+                inner_a.zero_grad(); inner_b.zero_grad()
+                ((model_a(x_a) - y_a) ** 2).mean().backward()
+                ((model_b(x_b) - y_b) ** 2).mean().backward()
+                inner_a.step()   # Inner step. Sync fires automatically inside post-hook
+                inner_b.step()   # at step %% SYNC_EVERY == 0.
+        # Assertions:
+        # 1. Both replicas now hold IDENTICAL parameters (they were averaged via mock allreduce).
+        for (na, pa), (nb, pb) in zip(model_a.named_parameters(), model_b.named_parameters()):
+            torch.testing.assert_close(pa, pb, msg=f"Replicas diverged at {na}")
+        # 2. Parameters changed from θ_initial (outer optimizer actually stepped).
+        any_change = any(
+            not torch.equal(p, theta_initial_a[n]) for n, p in model_a.named_parameters()
+        )
+        assert any_change, "outer optimizer did not move the parameters"
+        # 3. The outer optimizer holds Nesterov momentum state for every parameter
+        #    (proves the SGD(nesterov=True) actually ran).
+        n_params = len(list(model_a.parameters()))
+        assert len(outer_a.state_dict()["state"]) == n_params
+        # 4. Sync fired once per outer round per replica.
+        assert mgr_a.start_quorum.call_count == OUTER_ROUNDS
+        assert mgr_b.start_quorum.call_count == OUTER_ROUNDS
+```
+**Why this works:**
+- `DiLoCo` registers a post-step hook on `inner_optimizer` (see `__enter__`). The
+  hook increments `_local_step` and triggers `prepare_sync` / `perform_sync` on every
+  `sync_every` boundary — fully automatic, our test only calls `inner.step()`.
+- `_DummyWork.wait()` is a no-op. `_average_grads` calls `manager.allreduce(...)`
+  which our `avg_allreduce` mocks to do real cross-replica averaging through `buf`.
+- `manager.should_commit.return_value = True` lets the outer optimizer fire on each
+  outer round; setting it to `False` lets us also test rollback semantics.
+- All single-process — pytest plays nicely. Add to
+  `spikes/005-integrated-trainer-skeleton/tests/` style or new `spikes/008/tests/`.
+**Install for this spike:** `pip install torchft-nightly` in the eidolon venv. If the
+nightly wheel proves brittle, fallback: vendor `local_sgd.py` + `work.py` + a
+minimal `manager.py` stub (≈800 LOC) into `framework/diloco/_vendored/` with BSD-3
+attribution.
+---
+## Risks & Mitigations
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| `torchft-nightly` wheel breaks against torch 2.x | Med | Pin to a specific nightly hash; or vendor `local_sgd.py` directly under BSD-3. |
+| `torchft.manager.Manager` import pulls in Rust ext at import time | Low | The class is importable as a type; `MagicMock` replaces it. If import touches Rust, we vendor. Verified: the import in `local_sgd.py` is `from torchft.manager import Manager` — only used as a type annotation in our test path. |
+| Sign convention of pseudogradient causes our outer optimizer to move the wrong way | Med | Test 2 in the test pattern above explicitly checks "params moved from initial". A second test should compare the direction against a hand-computed expected. |
+| `fragment_sync_delay > 0` (true Streaming) requires CUDA streams | Med | Spike 008 starts with `fragment_sync_delay=0` (= vanilla DiLoCo). Streaming variant deferred to Spike 009 once basic loop works. |
+| Requires `torch>=2.7` per pyproject | Low | Framework already on torch 2.x; check exact pin. If <2.7, we vendor. |
+---
+## Decision (for ADR-003)
+Adopt **`torchft.local_sgd.DiLoCo`** as the reference DiLoCo / Streaming DiLoCo
+implementation. Integrate via `pip install torchft-nightly` for Spike 008. If
+brittleness emerges, vendor `local_sgd.py` (BSD-3) into `framework/diloco/_vendored/`.
+For the framework's outer-loop optimizer abstraction (the actual ADR-003 question):
+mirror torchft's `DiLoCo(manager, [model_fragments], inner_opt, outer_opt, sync_every)`
+constructor shape so that swapping our wrapper for the upstream class is a one-line
+change. Compute pseudogradient as `θ_local − θ_initial` (our convention) and negate
+when handing to the outer optimizer, OR follow torchft's `θ_initial − θ_local`
+convention end-to-end. **Pick one and document it loudly.**

docs/research/MODAL_RECONNAISSANCE.md ADDED Viewed

	@@ -0,0 +1,408 @@

+# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke
+**Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
+**Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
+**Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
+**Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.
+---
+## 1. Recommended Modal GPU type & estimated cost
+### 1.1 Pricing table (from primary source)
+All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.
+| GPU            | Modal `gpu=` string  | $ / sec      | $ / hour | VRAM   | Verdict for this smoke |
+|----------------|----------------------|--------------|----------|--------|------------------------|
+| Nvidia T4      | `"T4"`               | 0.000164     | 0.590    | 16 GB  | Too small for safe headroom on 3 fwd passes |
+| **Nvidia L4**  | `"L4"`               | **0.000222** | **0.799**| 24 GB  | ✅ **Recommended** — cheapest GPU that fits comfortably |
+| Nvidia A10     | `"A10"`              | 0.000306     | 1.102    | 24 GB  | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
+| Nvidia L40S    | `"L40S"`             | 0.000542     | 1.951    | 48 GB  | Overkill — Modal's default rec, but unjustified at 0.5B |
+| Nvidia A100-40GB| `"A100-40GB"`       | 0.000583     | 2.099    | 40 GB  | Overkill |
+| Nvidia A100-80GB| `"A100-80GB"`       | 0.000694     | 2.498    | 80 GB  | Overkill |
+| Nvidia H100    | `"H100!"`            | 0.001097     | 3.949    | 80 GB  | Wasteful |
+(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)
+**Auxiliary costs** (also primary, same page):
+- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
+- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
+- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
+- Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.
+### 1.2 Why L4, not A10G or A100-40GB
+The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*
+Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
+- Weights: ~1.0 GB
+- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
+- Gradients (bf16): ~1 GB
+- Activations for student fwd at B=2,T=1024: ~1–2 GB
+- Teacher fwd (no grad, no act save): ~0.3 GB
+- DPO chosen+rejected fwds (with grad): ~2–3 GB
+- HF transformers overhead, KV scratch, framework: ~2 GB
+- **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.
+**A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.
+**A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).
+**T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.
+### 1.3 Cost projection for the actual smoke
+Assume a 30-min wall-clock budget that breaks down realistically as:
+- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
+- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
+- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
+- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
+- Logging, save, exit: 5 s.
+**Realistic total: ~3–7 minutes of GPU-billed time per run.**
+Cost per run on L4:
+- Lower bound (3 min): 180 s × $0.000222 = **$0.040**
+- Upper bound (7 min): 420 s × $0.000222 = **$0.093**
+- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037
+**Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.
+For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.
+### 1.4 Region & preemption multipliers (DON'T trip on these)
+From the pricing-page footer:
+- **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
+- **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.
+---
+## 2. Minimal `modal_app.py` skeleton
+This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.
+```python
+"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.
+Goal: run ~50 forward+backward steps of the 3-channel loss
+(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
+capture peak VRAM and per-step latency, and exit. Single L4, single container.
+Run:    modal run modal_app.py
+Logs:   the function's print() output streams to your terminal.
+"""
+from __future__ import annotations
+import modal
+# ---------------------------------------------------------------------------
+# 1) App + image
+# ---------------------------------------------------------------------------
+# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
+# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
+# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
+# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
+# correct override hook (DeepWiki audit anchor: huggingface/trl).
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("git")
+    .pip_install(
+        "torch==2.4.1",                  # CUDA 12.1 wheel from PyPI default index
+        "transformers==4.46.3",
+        "accelerate==1.1.1",
+        "peft==0.14.0",
+        "trl==0.12.2",
+        "datasets==3.1.0",
+        "huggingface_hub==0.26.5",
+    )
+    .env({
+        # Force HF to use the mounted Volume for model + dataset cache.
+        "HF_HOME": "/cache/hf",
+        "TRANSFORMERS_CACHE": "/cache/hf",
+        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # parallel download for the model
+        # Make Python flush prints immediately so we see step times live.
+        "PYTHONUNBUFFERED": "1",
+        # Reproducibility for the smoke.
+        "TOKENIZERS_PARALLELISM": "false",
+    })
+)
+# ---------------------------------------------------------------------------
+# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
+# ---------------------------------------------------------------------------
+# 1 GB of Qwen weights persists here. First run pays the download cost,
+# every subsequent run reuses the volume. Below 1 TiB / mo: free.
+hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)
+# ---------------------------------------------------------------------------
+# 3) App + secrets
+# ---------------------------------------------------------------------------
+app = modal.App("composer-replication-smoke")
+# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
+hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[])  # no-op safety
+# ---------------------------------------------------------------------------
+# 4) The smoke function
+# ---------------------------------------------------------------------------
+@app.function(
+    image=image,
+    gpu="L4",                       # see §1: cheapest 24 GB option that fits
+    cpu=4.0,                        # 4 cores is plenty for tokenization on a sub-1B
+    memory=16 * 1024,               # 16 GiB RAM is plenty
+    volumes={"/cache": hf_cache},
+    timeout=60 * 30,                # hard 30-min cap matches the smoke spec
+    secrets=[hf_secret],
+    # NB: keep preemptible (default). Don't pay 3× to pin.
+    # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
+)
+def smoke():
+    import time
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
+    print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
+          f"device={torch.cuda.get_device_name(0)} "
+          f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
+    # -------------------------------------------------------------------
+    # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
+    # -------------------------------------------------------------------
+    t0 = time.perf_counter()
+    tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_ID,
+        cache_dir="/cache/hf",
+        torch_dtype=torch.bfloat16,
+        device_map="cuda:0",
+    )
+    model.train()
+    print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
+          f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")
+    # -------------------------------------------------------------------
+    # 50-step verification loop.
+    #
+    # NOTE: this stub uses a synthetic batch — a single forward+backward
+    # against an LM-head loss — *not* the full 3-channel loss. The point
+    # is to (a) verify the Modal harness, (b) measure the per-step time
+    # of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
+    #
+    # Replace the body of the for-loop with the actual ComposerReplicationTrainer
+    # `_compute_loss` call once data_collator outputs are stubbed/mocked.
+    # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
+    # -------------------------------------------------------------------
+    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)
+    # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
+    B, T = 2, 1024
+    input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
+    labels = input_ids.clone()
+    torch.cuda.reset_peak_memory_stats()
+    step_times = []
+    for step in range(50):
+        t = time.perf_counter()
+        out = model(input_ids=input_ids, labels=labels)
+        out.loss.backward()
+        optimizer.step()
+        optimizer.zero_grad(set_to_none=True)
+        torch.cuda.synchronize()
+        dt = time.perf_counter() - t
+        step_times.append(dt)
+        if step % 10 == 0:
+            print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
+                  f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")
+    # -------------------------------------------------------------------
+    # Final report.
+    # -------------------------------------------------------------------
+    median_ms = sorted(step_times)[len(step_times)//2] * 1000
+    p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
+    peak_gb = torch.cuda.max_memory_allocated() / 1e9
+    print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
+          f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")
+    # Persist cache for the next run.
+    hf_cache.commit()
+@app.local_entrypoint()
+def main():
+    smoke.remote()
+```
+### 2.1 What's deliberately *not* in the skeleton
+- **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
+- **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
+- **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
+- **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
+- **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
+- **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.
+### 2.2 What you do need to add when you wire the real loss
+Replace the synthetic `for step in range(50)` body with:
+```python
+from data_collator import ComposerDataCollator        # spike 005 path
+from trl_path.composer_trainer import ComposerReplicationTrainer
+# ...
+# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
+# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
+# point is to verify the loss path, not the rollout path.
+```
+The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.
+---
+## 3. Gotchas that bite *this specific workload*
+The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B���30B training. Most of them don't apply here. The ones that do:
+### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly
+`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):
+```python
+student_logits = model(input_ids=inputs["input_ids"]).logits      # with grad
+with torch.no_grad():
+    teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
+```
+Two issues:
+1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
+2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.
+### 3.2 The DPO channel does TWO more grad'd forwards per step
+`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:
+| Forward | Grad? | Notes |
+|---------|-------|-------|
+| `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
+| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
+| Teacher in SDPO | no | hint-conditioned context |
+| DPO chosen | yes | only when beta_replay ≠ 0 |
+| DPO rejected | yes | only when beta_replay ≠ 0 |
+**That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.
+For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.
+### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun
+In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:
+```python
+return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
+```
+This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.
+Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.
+### 3.4 `torch.cuda.synchronize()` before timing reads
+If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.
+### 3.5 The HF cache Volume must be `commit()`ed
+From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.
+### 3.6 What does NOT bite
+These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:
+- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
+- ❌ `accelerate launch` / multi-process distributed. Single GPU.
+- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
+- ❌ Tensor parallelism / sequence parallelism. Single GPU.
+- ❌ Multi-node clusters. Single node.
+- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
+- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
+- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.
+---
+## 4. Decision rule: Modal vs the local 5090
+### 4.1 The numbers
+**Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
+- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
+- 50 steps: **~15 seconds of pure compute**.
+- Plus model load (one-time, from local HF cache): ~5 seconds.
+- Plus data collator setup: ~3 seconds.
+- **Wall clock: ~25–40 seconds.**
+- **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).
+**Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
+- Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
+- 50 steps: **~100 seconds of pure compute**.
+- Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
+- **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
+- **Cost: $0.08–$0.13 per run.**
+### 4.2 The decision rule
+> **For this specific 30-min smoke: run on the 5090. Do not use Modal.**
+Reasoning:
+1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
+2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
+3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
+4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
+5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.
+**When Modal becomes correct:**
+| Scenario | Modal? | Why |
+|----------|--------|-----|
+| 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
+| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
+| Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
+| Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
+| 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |
+### 4.3 Recommended workflow
+1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
+2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
+3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.
+---
+## 5. References
+All claims in this document are sourced from:
+- **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
+- **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
+- **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
+- **Volumes**: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence.
+- **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
+- **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
+- **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
+- **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
+- **Local trainer code reviewed**:
+  - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
+  - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`
+---
+## 6. TL;DR
+1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
+2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
+3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
+4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.

docs/research/TRACE_SOURCE_RECONNAISSANCE.md ADDED Viewed

	@@ -0,0 +1,403 @@

+# TRACE_SOURCE_RECONNAISSANCE.md
+Spike 007 trace-source audit, feeding ADR-002.
+Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).
+---
+## 0. TL;DR
+Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.
+The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.
+---
+## 1. Context: TraceExample dataclass field reality
+**Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
+`/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:
+```python
+class TraceState(TypedDict):
+    state_id: str           # unique within the trace
+    messages: list[dict]    # conversation up to and including this step's user prompt
+    student_action: str     # what the student actually did at this step
+class DPOPair(TypedDict):
+    state_id: str
+    state_messages: list[dict]
+    chosen: str       # teacher-consensus action
+    rejected: str     # student action
+    n_teachers_agreeing: int
+```
+The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle.
+---
+## 2. Candidate audit summary
+Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.
+| # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict |
+|---|---|---|---|---|---|---|
+| **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** |
+| b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject |
+| c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up |
+| d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject |
+| e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set |
+| f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** |
+The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.
+---
+## 3. Chosen format spec — Claude Code session JSONL
+### 3.1 Location and naming
+- **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
+  Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
+- **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
+  Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
+- **File**: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions.
+  Source: same `claude_skills` doc, §"Subagent File Location".
+- **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
+  Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")
+### 3.2 Common record fields
+Every record (both user and assistant types) carries:
+| field | type | meaning |
+|---|---|---|
+| `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) |
+| `uuid` | `string` | This record's UUID |
+| `sessionId` | `string` | UUID of the session (matches filename) |
+| `timestamp` | `string` (ISO-8601) | Wall-clock time of the record |
+| `cwd` | `string` | Absolute working directory |
+| `version` | `string` | Claude Code version (e.g. `"2.1.143"`) |
+| `gitBranch` | `string` | Empty string `""` when not in a git repo |
+| `isSidechain` | `boolean` | True for sub-agent (Task tool) chains |
+| `userType` | `string` | `"external"` or similar |
+| `type` | `string` | Discriminator — see §3.3 |
+| `entrypoint` | `string` | e.g. `"sdk-cli"` |
+Sources for these fields:
+- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
+- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
+- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
+- Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.
+### 3.3 Record types (`type` discriminator)
+| `type` | Role |
+|---|---|
+| `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) |
+| `assistant` | Model output: text, `thinking`, and `tool_use` blocks |
+| `system` | Hook summaries, stop notices |
+| `summary` | Context-compaction markers |
+| `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output |
+| `queue-operation` | Prompt enqueue/dequeue events |
+| `file-history-snapshot` | File-state tracking for undo |
+| `last-prompt` | Bookkeeping for resume |
+Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.
+### 3.4 The two record types we care about
+#### Assistant record carrying a tool call (the "student action")
+Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:
+```json
+{
+  "type": "assistant",
+  "uuid": "24a16a51-3133-4ba5-9d23-472864286154",
+  "parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
+  "sessionId": "39df59f0-…",
+  "timestamp": "2026-05-16T04:52:21.947Z",
+  "message": {
+    "role": "assistant",
+    "model": "claude-opus-4-7",
+    "content": [
+      {
+        "type": "tool_use",
+        "id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
+        "name": "Bash",
+        "input": {
+          "command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
+          "description": "Check builder agent inbox"
+        }
+      }
+    ],
+    "stop_reason": "tool_use",
+    "usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
+  }
+}
+```
+The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).
+#### User record carrying a tool result (the "observation")
+```json
+{
+  "type": "user",
+  "uuid": "b9f9414b-…",
+  "parentUuid": "24a16a51-…",            // matches the assistant uuid above
+  "sessionId": "39df59f0-…",
+  "timestamp": "2026-05-16T04:52:23.229Z",
+  "message": {
+    "role": "user",
+    "content": [
+      {
+        "tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
+        "type": "tool_result",
+        "content": "  No new messages",
+        "is_error": false
+      }
+    ]
+  },
+  "toolUseResult": {                       // duplicate, structured form
+    "stdout": "  No new messages",
+    "stderr": "",
+    "interrupted": false,
+    "isImage": false,
+    "noOutputExpected": false
+  },
+  "sourceToolAssistantUUID": "24a16a51-…"  // back-pointer to the assistant uuid
+}
+```
+User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).
+### 3.5 Schema stability
+- **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema.
+- **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
+- **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).
+### 3.6 Licensing
+- The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
+- The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
+- Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
+- We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests.
+---
+## 4. Acquiring the 5 real example traces
+**Zero acquisition cost.** All five live on this machine right now.
+Discovery command (used during this audit):
+```bash
+find ~/.claude/projects -name "*.jsonl" 2>/dev/null
+# → 1015 files
+```
+Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:
+| # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path |
+|---|---|---|---|---|---|
+| 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` |
+| 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` |
+| 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` |
+| 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` |
+| 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` |
+(All five inspected programmatically during this audit — counts above are real, not estimates.)
+For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces.
+---
+## 5. Decision-relevant tradeoffs vs runners-up
+### Why we are NOT picking OpenHands trajectories (c)
+- **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing).
+- **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
+- **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
+- **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.
+### Why we are NOT picking SWE-bench leaderboard trajectories (e)
+- **Pro**: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
+- **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
+- **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike.
+### Why we are NOT picking Aider (d)
+- The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
+- **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.
+### Why we are NOT picking Cline (b)
+- No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.
+### Why we are NOT picking SWE-smith-trajectories (f)
+- This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
+- **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.
+---
+## 6. TraceIngester sketch
+Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
+```python
+# spikes/007-trace-ingester/trace_ingester.py
+from __future__ import annotations
+import json
+from collections.abc import Iterator
+from pathlib import Path
+from typing import Any
+# Re-use the existing TypedDicts from spike-005:
+#   from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState
+# A "step" in the trace is each assistant record that ends in tool_use. The
+# state visible to the model at that step = all messages strictly before it,
+# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).
+def _record_to_chat_message(rec: dict) -> dict | None:
+    """Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
+    dict, or return None for non-conversational records (queue-operation,
+    attachment, file-history-snapshot, system, last-prompt, summary)."""
+    t = rec.get("type")
+    if t not in ("user", "assistant"):
+        return None
+    msg = rec.get("message")
+    if not isinstance(msg, dict):
+        return None
+    role = msg.get("role")
+    content = msg.get("content")
+    if role not in ("user", "assistant") or content is None:
+        return None
+    # Strip thinking blocks — they are not portable across teacher models and
+    # should not influence the teacher's decision at replay time.
+    if isinstance(content, list):
+        content = [c for c in content
+                   if not (isinstance(c, dict) and c.get("type") == "thinking")]
+    return {"role": role, "content": content}
+def _serialize_action(content_blocks: list[dict]) -> str:
+    """Canonicalize the student's action at a step.
+    For tool_use steps: JSON-encode the (name, input) pairs.
+    For text-only steps: return the concatenated text.
+    """
+    tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
+    if tool_uses:
+        return json.dumps(
+            [{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
+            sort_keys=True,
+        )
+    texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
+    return "\n".join(t for t in texts if t)
+class TraceIngester:
+    """Reads a Claude Code session JSONL and yields TraceState records.
+    One TraceState is emitted per assistant record. The `messages` field is the
+    full prior conversation (system + alternating user/assistant) up to but not
+    including the current assistant turn; `student_action` is the canonicalized
+    serialization of that turn's content blocks.
+    """
+    def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
+        self.skip_thinking = skip_thinking
+        self.min_action_chars = min_action_chars
+    def ingest(self, path: str | Path) -> Iterator[dict]:  # yields TraceState
+        path = Path(path)
+        prior_messages: list[dict] = []
+        session_id_for_state = path.stem  # filename = session UUID
+        with path.open("r", encoding="utf-8") as f:
+            for line_idx, line in enumerate(f):
+                line = line.strip()
+                if not line:
+                    continue
+                try:
+                    rec = json.loads(line)
+                except json.JSONDecodeError:
+                    continue  # tolerate truncated last-line writes
+                chat_msg = _record_to_chat_message(rec)
+                if chat_msg is None:
+                    continue
+                if chat_msg["role"] == "assistant":
+                    # Emit a TraceState representing "before this turn".
+                    blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
+                    student_action = _serialize_action(blocks)
+                    if len(student_action) >= self.min_action_chars:
+                        yield {
+                            "state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
+                            "messages": list(prior_messages),    # snapshot
+                            "student_action": student_action,
+                        }
+                # Append to history regardless (so subsequent turns see it).
+                prior_messages.append(chat_msg)
+```
+Notes:
+- We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
+- We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
+- `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
+- Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.
+### 6.1 Smoke-test plan (for Spike 007 itself)
+```python
+ingester = TraceIngester()
+states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
+# Expect roughly 197 states (matches asst-message count counted in §4).
+# Then teacher-replay on the first 5 states, confirm cost is in the
+# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
+```
+Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up.
+---
+## 7. Open questions for ADR-002
+1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
+2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
+3. Synthetic system prompt at replay time — yes/no? If yes, what content?
+4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
+5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.
+---
+## 8. References (primary sources only)
+Anthropic / Claude Code official:
+- <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
+- <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
+- <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
+- <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license
+Community schemas (reverse-engineered from real session data):
+- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
+- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
+- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
+- <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
+- <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs
+Runners-up reference points:
+- OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
+- SWE-bench experiments: <https://github.com/swe-bench/experiments>
+- SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
+- mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
+- swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
+- Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>
+Internal references:
+- `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
+- Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating