Codeseys commited on
Commit
ac4bfb4
·
1 Parent(s): 040eff8

Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs

Browse files

Imports the gap-closer wave from VISION_VALIDATION.md as a structured backlog,
records the deep work loop log, and locks three architecture decisions backed
by primary-source research from three subagent recons.

Backlog items (CPU-only, no GPU budget):
- Spike 006 — real HF model smoke (Wave 7 next)
- Spike 007 — real trace ingestion (Wave 8)
- Spike 008 — Streaming DiLoCo smoke (Wave 9)
- Wave 10 — packaging (pyproject.toml + examples/)
- Spike 002a-mini — Modal-gated GPU smoke (Phase 10)

Research (docs/research/, 1168 lines total):
- MODAL_RECONNAISSANCE.md — Modal pricing + setup, primary-sourced from
modal.com/pricing and modal.com/docs. Verdict: Modal L4 is $0.08-0.13 per
smoke run but loses to local 5090 by 10x on iteration cycle (3-5min vs
25-40s). Modal becomes correct for parallel sweeps, 7B+ models, multi-node
training, or CI repro — none of which apply to gap-closer wave.
- DILOCO_RECONNAISSANCE.md — audited 5 candidates. meta-pytorch/torchft wins:
BSD-3, 312 commits, HEAD 2026-04-03, library-not-research-code, prebuilt
wheels, single-process unit-testable via MagicMock(Manager) pattern. Their
DiLoCo class IS the Streaming generalization (vanilla = single fragment).
Sign-convention mismatch flagged for explicit test in Spike 008.
- TRACE_SOURCE_RECONNAISSANCE.md — corrected the existing dataclass (it's
TraceState TypedDict, not TraceExample). Recommended Claude Code session
JSONL: 1015 local sessions on this machine, zero acquisition cost, 6762
tool_use messages across 5 pre-selected sessions, schema validated by 4
independent community projects + JSON Schema validated against ~50000 real
messages.

ADRs (docs/adrs/):
- ADR-001 — GPU venue: local 5090. Modal stashed for parallel sweeps and 7B+
workloads. Migration path documented.
- ADR-002 — Trace source: Claude Code session JSONL. Pattern opens door for
OpenHands and SWE-smith ingesters in v0.2.
- ADR-003 — DiLoCo impl: torchft.local_sgd.DiLoCo with shared-buffer mock
allreduce for Spike 008 single-process test. Sign-convention mismatch
caught with explicit unit test.

Deep work loop log: docs/DEEP_WORK_LOOP_LOG.md tracks all 12 phases with
status. Phases 1-4 complete; phase 5 (planning waves 7-10) is next.

Tests still 38/38 green; no code changes in this wave.

Refs: docs/VISION_VALIDATION.md gaps V2/V4/V5/V8.

BACKLOG.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Backlog — Composer 2.5 Replication Framework
2
+
3
+ Imported from `docs/VISION_VALIDATION.md` § 6 (gaps) + § 9 (gap-closers) at 2026-05-26.
4
+
5
+ ## Active items (CPU-only, no GPU budget)
6
+
7
+ ### Spike 006 — Real HF model smoke (Wave 7)
8
+
9
+ **Closes**: V8 ("any HF model") — currently we run only mock 4-layer toy LM through `composer_total_loss`.
10
+
11
+ **Goal**: prove the 3-channel loss (`grpo + α·sdpo_kl + β·trace_replay_dpo`) survives a real `transformers` model + tokenizer with finite gradients and a decreasing loss across N steps.
12
+
13
+ **Acceptance**:
14
+ 1. `AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")` loads on CPU.
15
+ 2. Real tokenizer `apply_chat_template` produces `input_ids` shape that flows through `composer_total_loss(model, batch)` without mock shapes.
16
+ 3. 5 backward steps run on CPU without `nan` / `inf` / shape mismatch.
17
+ 4. Loss is monotone non-increasing across 5 steps (trend; allow noise).
18
+ 5. New tests added under `spikes/006-real-hf-model-smoke/tests/` pass alongside existing 38.
19
+
20
+ **Estimate**: half a day, CPU only.
21
+
22
+ ### Spike 007 — Real trace ingestion (Wave 8)
23
+
24
+ **Closes**: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces."
25
+
26
+ **Goal**: pick ONE real agent-session log format with stable, public schema, write a `TraceIngester` that converts it to our `TraceExample` dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states.
27
+
28
+ **Acceptance**:
29
+ 1. ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories).
30
+ 2. `TraceIngester.ingest(path: Path) -> Iterator[TraceExample]` is implemented + has unit tests with a fixture log file.
31
+ 3. End-to-end smoke: real trace → ingester → collator → 1-step `composer_total_loss` runs without error.
32
+ 4. Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to `spikes/007-*/verdict.md`.
33
+
34
+ **Estimate**: 1 day + ~$2 OpenRouter.
35
+
36
+ ### Spike 008 — Streaming DiLoCo smoke (Wave 9)
37
+
38
+ **Closes**: V2 (DiLoCo "deferred to v0.2" — drift from original brief).
39
+
40
+ **Goal**: bolt outer-loop pseudo-gradient sync onto the loss composition test using two `nn.Module` replicas on the same node. No real distributed training (CPU multiprocessing or single-process).
41
+
42
+ **Acceptance**:
43
+ 1. ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo).
44
+ 2. `outer_optimizer.py` implements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step.
45
+ 3. Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance.
46
+ 4. 38 existing tests still pass (no regression).
47
+
48
+ **Estimate**: 2 days, CPU.
49
+
50
+ ### Wave 10 — Packaging
51
+
52
+ **Closes**: V4 ("skeleton not framework").
53
+
54
+ **Goal**: turn the assemblage of spike directories into an installable Python package with a clear quickstart.
55
+
56
+ **Acceptance**:
57
+ 1. `pyproject.toml` at repo root, package name `composer_replication`.
58
+ 2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
59
+ 3. `examples/qwen3_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
60
+ 4. README quickstart updated to `pip install -e .` + `python examples/qwen3_05b_quickstart/run.py`.
61
+ 5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
62
+
63
+ **Estimate**: half a day, CPU.
64
+
65
+ ## Modal-gated (if budget allows after gap-closers)
66
+
67
+ ### Spike 002a-mini — Real GPU smoke (Phase 10)
68
+
69
+ **Closes**: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only.
70
+
71
+ **Goal**: dispatch a 30-min A10G smoke on Modal that runs Spike 006 unchanged on GPU, verifies bf16 numerics, captures memory + step-time.
72
+
73
+ **Acceptance**:
74
+ 1. ADR-001 says Modal is the right choice for this workload + estimate is < $5.
75
+ 2. Modal app builds, runs `composer_total_loss` for 50 steps on Qwen2.5-0.5B-Instruct.
76
+ 3. Loss curve + memory profile saved to `spikes/002a-mini/` and pulled to local.
77
+ 4. No new shape / dtype bug surfaced vs CPU run.
78
+
79
+ **Estimate**: $1–3, 30 min wall-clock.
80
+
81
+ ## Deferred (post-loop, GPU-gated)
82
+
83
+ - Spike 002a/002b — full trace collection on A100 ($30–50)
84
+ - Spike 003 — DPO-pair signal density study
85
+ - Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0
86
+ - Publication wave — author identity, thumbnail, X tags, post sequence
87
+
88
+ ## Process notes
89
+
90
+ - Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks.
91
+ - Each spike has its own `spikes/00N-name/` dir + `verdict.md` recording acceptance + delta from estimate.
92
+ - Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs.
docs/DEEP_WORK_LOOP_LOG.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deep Work Loop Log — Composer 2.5 Replication Framework
2
+
3
+ Started: 2026-05-26
4
+ Operator: Codeseys (Hermes Agent autonomous loop)
5
+ Skill: `deep-work-loop` v1.0.0
6
+
7
+ ## Vision
8
+
9
+ > Take any HuggingFace model → further RL train it using:
10
+ > 1. RLVR (tests-pass reward),
11
+ > 2. SDPO/hint-distillation (Composer 2.5's "targeted RL with textual feedback"),
12
+ > 3. multi-teacher trace-replay DPO,
13
+ > integrated against TRL/VeRL/OpenEnv with DiLoCo-style outer loop sync.
14
+ >
15
+ > Output: a published, reproducible framework — the "Composer 2.5 replication" the open ecosystem is missing.
16
+
17
+ ## Starting state
18
+
19
+ - HEAD: `040eff8` (Wave 6: vision validation self-audit, 5/10 scorecard)
20
+ - Tests: 38/38 green in `spikes/005-integrated-trainer-skeleton/`
21
+ - Working tree: clean
22
+
23
+ ## Phase ledger
24
+
25
+ | Phase | Description | Status | Started | Done |
26
+ |---|---|---|---|---|
27
+ | 1 | commit-state | ✅ | 2026-05-26 | 2026-05-26 |
28
+ | 2 | backlog-audit (BACKLOG.md from VISION_VALIDATION) | ✅ | 2026-05-26 | 2026-05-26 |
29
+ | 3 | parallel-research (3 subagents) | 🟡 | 2026-05-26 | |
30
+ | 4 | architect with ADRs (ADR-001..003) | ⏳ | | |
31
+ | 5 | plan in waves (W7–W10) | ⏳ | | |
32
+ | 6 | execute W7 — Spike 006 (real HF model smoke) | ⏳ | | |
33
+ | 7 | execute W8 — Spike 007 (real trace ingestion) | ⏳ | | |
34
+ | 8 | execute W9 — Spike 008 (DiLoCo smoke) | ⏳ | | |
35
+ | 9 | execute W10 — packaging | ⏳ | | |
36
+ | 10 | (Modal-gated) Spike 002a-mini real GPU smoke | ⏳ | | |
37
+ | 11 | cross-model-final-review | ⏳ | | |
38
+ | 12 | update scorecard + push | ⏳ | | |
39
+
40
+ ## Constraints
41
+
42
+ - Verify ALL claims against primary sources (Wave 2 lesson — subagent synthesis is not evidence).
43
+ - Tests must pass before commit.
44
+ - Memory L1 is at 99% — write to L2 wiki + L3 fact_store, not L1.
45
+ - Modal budget: $20 hard cap for this loop. Anything more goes to user for approval.
46
+ - No `upload_file` mixing with `git push` — `git push hf master:main` only.
47
+ - Commit messages via `-F /tmp/<wave>-commit-msg.txt`.
docs/adrs/ADR-001-gpu-venue.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-001 — GPU venue for Spike 002a-mini smoke
2
+
3
+ **Status**: Accepted
4
+ **Date**: 2026-05-26
5
+ **Wave**: Phase 4 (deep work loop)
6
+
7
+ ## Context
8
+
9
+ Spike 002a-mini is the optional Phase-10 gate in the deep work loop: take the
10
+ real-HF-model loss-composition smoke (Spike 006) and run it on GPU to confirm
11
+ bf16 numerics, capture memory + step-time, and rule out CPU-only blind spots
12
+ before publishing the framework.
13
+
14
+ The user has:
15
+ - a 5090 (32 GB VRAM, Blackwell) on the local box (this WSL host)
16
+ - a configured Modal account (~/.modal.toml present, modal CLI installed)
17
+
18
+ The workload:
19
+ - `Qwen/Qwen2.5-0.5B-Instruct` (~1 GB bf16 weights)
20
+ - ~50 forward+backward steps through the 3-channel loss
21
+ - single GPU, no distributed training, no FSDP
22
+
23
+ ## Options considered
24
+
25
+ ### Option A — Local 5090
26
+
27
+ - Free, no rate limit, no cold start.
28
+ - Iteration loop: code change → run → fix → run is **~25-40 s wall-clock per cycle**.
29
+ - 32 GB VRAM ≫ 24 GB needed for this workload.
30
+ - WSL CUDA path is the same one we use for eidolon training already; toolchain proven.
31
+ - Reuses local HF cache (~/.cache/huggingface), no re-download per run.
32
+
33
+ ### Option B — Modal L4 ($0.000222/sec ≈ $0.799/hr)
34
+
35
+ - $0.08-0.13 per smoke run (3-7 min wall-clock incl cold start).
36
+ - Iteration loop: code change → modal-run dispatch → image build (cached) → cold
37
+ start → run → modal volume get → fix is **~3-5 min per cycle even on a cache hit**.
38
+ - Persistent volume saves model re-download across runs.
39
+ - Decoupled from local environment state.
40
+ - Extensively documented gotchas in `mlops/modal-llm-training` skill (M1-M9).
41
+
42
+ ### Option C — Modal A100-40GB
43
+
44
+ - ~3× cost of L4 for 0.5B workload that doesn't need the capacity. Ruled out.
45
+
46
+ ## Decision
47
+
48
+ **Option A — local 5090.** The 5090 dominates Modal L4 on every dimension that
49
+ matters for a 0.5B sub-1B-param verification smoke:
50
+
51
+ | Dimension | 5090 (local) | Modal L4 |
52
+ |---|---|---|
53
+ | Iteration cycle | 25-40 s | 3-5 min (10× slower) |
54
+ | $ / smoke run | $0 | $0.10 |
55
+ | VRAM headroom | 32 GB > 24 GB needed | 24 GB ≈ 24 GB needed |
56
+ | State decoupling | Same machine as dev | Decoupled (advantage Modal) |
57
+ | Toolchain risk | Already proven | New for this workload |
58
+
59
+ The "decoupled state" advantage of Modal is real but doesn't outweigh the 10×
60
+ iteration penalty for what is fundamentally a verification step. We're not
61
+ running production training; we're checking that a GPU run agrees with the
62
+ CPU run we just did.
63
+
64
+ ## Consequences
65
+
66
+ ### Accepted
67
+
68
+ - Spike 002a-mini becomes a **local 5090** smoke, not a Modal job.
69
+ - The `mlops/modal-llm-training` skill's L4 pattern (modal_app.py skeleton in
70
+ `docs/research/MODAL_RECONNAISSANCE.md`) is **stashed for future use** — it's
71
+ the right pattern when we DO need cloud GPU.
72
+ - `docs/research/MODAL_RECONNAISSANCE.md` stays in the repo as the design
73
+ document for the Modal path; the file documents *why* we didn't use Modal
74
+ for this smoke and *when* Modal becomes correct.
75
+
76
+ ### Modal becomes the right choice when
77
+
78
+ 1. **Parallel parameter sweeps** — N independent runs across α, β, lr, etc.
79
+ that need to fan out faster than wall-clock-sequential on a single 5090.
80
+ 2. **Scaling to ≥7B base models** — 5090's 32 GB starts to bind on 7B + LoRA
81
+ + activation memory at seq 4096+. A100-40 or H100 becomes necessary.
82
+ 3. **Multi-node training** — DiLoCo-style outer-loop across 2+ physical
83
+ nodes for the eventual full RL run.
84
+ 4. **CI / reproducibility** — a future contributor wants to repro our results
85
+ without owning a 5090.
86
+
87
+ These are all **post-replication** workloads. The deep work loop's gap-closer
88
+ phase (W7-W10) doesn't need any of them.
89
+
90
+ ### Trade-offs explicitly accepted
91
+
92
+ - We carry one local-environment dependency (WSL CUDA + the 5090 driver) that
93
+ Modal would have absorbed. Mitigated by: the same dependency is already
94
+ exercised by eidolon training, so the marginal risk is zero.
95
+ - We don't get an audit-friendly "Modal app run with persistent receipts"
96
+ artifact. Mitigated by: capturing `nvidia-smi` snapshots + step-time CSV
97
+ into `spikes/006-real-hf-model-smoke/results/` as our local audit trail.
98
+
99
+ ## Source
100
+
101
+ `docs/research/MODAL_RECONNAISSANCE.md` (subagent recon, primary-sourced from
102
+ modal.com/pricing and modal.com/docs, 2026-05-26).
docs/adrs/ADR-002-trace-source.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-002 — Trace source for Spike 007 (real LLM-application traces)
2
+
3
+ **Status**: Accepted
4
+ **Date**: 2026-05-26
5
+ **Wave**: Phase 4 (deep work loop)
6
+
7
+ ## Context
8
+
9
+ Spike 007 closes V5 of the vision validation: "real LLM-application traces."
10
+ Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement.
11
+ The framework's brief explicitly said *real traces*, so we owe Spike 007 a
12
+ primary-sourced ingestion path that converts a real, public, multi-turn agent
13
+ trace format into our existing `TraceState` TypedDict.
14
+
15
+ Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`):
16
+
17
+ ```python
18
+ class TraceState(TypedDict):
19
+ state_id: str # unique within the trace
20
+ messages: list[dict] # OpenAI-style conversation up to + incl this step
21
+ student_action: str # what the student did at this step
22
+ ```
23
+
24
+ (Earlier deep-work-loop notes called this `TraceExample` — that was a brain
25
+ glitch; the actual type is `TraceState` and there is no `TraceExample`.)
26
+
27
+ ## Options considered
28
+
29
+ | Option | Schema | Acquisition | Signal density | License |
30
+ |---|---|---|---|---|
31
+ | (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT |
32
+ | (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned |
33
+ | (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT |
34
+ | (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 |
35
+ | (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) |
36
+ | (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT |
37
+
38
+ Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon).
39
+
40
+ ## Decision
41
+
42
+ **Option (a) — Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionid>.jsonl`.
43
+
44
+ Wins on every axis we care about for Spike 007:
45
+
46
+ 1. **Acquisition cost: zero.** 1,015 real sessions already on this machine
47
+ from the user's daily Claude Code use. No download, no consent
48
+ negotiation, no rate limiting, no schema change risk during ingestion
49
+ development.
50
+
51
+ 2. **Schema stability: empirically validated.** The subagent ran a programmatic
52
+ audit on 8 real sessions; record types are stable across all of them.
53
+ Anthropic publishes user-facing docs for the format; four independent
54
+ community projects (claude-code-cli-tools, claudeflow, etc.) ship
55
+ working parsers including one with a JSON Schema validated against
56
+ ~50,000 real messages.
57
+
58
+ 3. **Signal density: maximal.** Every `tool_use` block is a candidate
59
+ teacher-correction site. The 5 pre-selected sessions in the recon doc
60
+ contain 6,762 tool_use messages (range 125 → 2,830 per session). That's
61
+ 100× the density of Spike 001's 50 synthetic states.
62
+
63
+ 4. **License: clean.** The trace files are user-owned files on the user's
64
+ own machine. We don't redistribute them with the framework. The
65
+ *ingester* code we write is MIT and ships in the framework. Anyone
66
+ running the framework who wants real-trace ingestion uses their own
67
+ local Claude Code sessions.
68
+
69
+ ## Consequences
70
+
71
+ ### Accepted
72
+
73
+ - Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]`
74
+ for the Claude Code JSONL format.
75
+ - The TraceIngester ships as part of the package (Wave 10 packaging) under
76
+ `composer_replication.ingestion.claude_code`.
77
+ - The recon doc's 5 pre-selected real sessions become the **smoke fixture**
78
+ for Spike 007's tests. We pin to a known set of session IDs so the test
79
+ is deterministic locally; CI users substitute their own.
80
+ - `ingestion/` directory pattern is established now to support adding
81
+ ingesters for OpenHands and SWE-smith later if Spike 007 reveals
82
+ signal-density gaps.
83
+
84
+ ### Open questions resolved by ADR-002
85
+
86
+ 1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`).
87
+ A single assistant turn often emits multiple `tool_use` blocks for one
88
+ reasoning step; treating each tool_use as a separate state would
89
+ over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE
90
+ §5.
91
+
92
+ 2. **`student_action` mapping** — The literal text of the assistant turn
93
+ (concatenated `text` blocks of the Claude message) becomes
94
+ `student_action`. The teacher-replay channel asks N teachers to produce
95
+ their version of "what should the assistant do here?" given the
96
+ `messages` history; we then DPO-compare teacher consensus vs literal
97
+ student text.
98
+
99
+ 3. **Thinking blocks** — Strip `thinking` blocks from the message history
100
+ passed to teachers (teachers don't have access to Claude's reasoning
101
+ trace). KEEP them in the `student_action` for the student's own
102
+ reproduction loop, since that's the actual generation we'd be RL-training.
103
+
104
+ 4. **System prompt** — Inject a synthetic system prompt at message[0] of
105
+ each `TraceState` describing "you are a coding agent" so teachers
106
+ without their own coding-agent system prompt have a fair playing field.
107
+
108
+ 5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions.
109
+ Subagent traces have a different structure (parent task ID etc.) that
110
+ would complicate the v0.1 ingester.
111
+
112
+ ### Recon-flagged risk (not blocking)
113
+
114
+ - Anthropic doesn't publish a versioned schema. The TraceIngester pins to
115
+ known record-types as of 2026-05-26 and gracefully degrades on unknown
116
+ types. If Anthropic ships a breaking change to the JSONL format, we'd
117
+ need to bump a `schema_version` constant in the ingester. Acceptable
118
+ ongoing maintenance burden.
119
+
120
+ ### Future ingesters
121
+
122
+ Open the door for two more ingesters in v0.2:
123
+ - `composer_replication.ingestion.openhands` — for users who run OpenHands
124
+ - `composer_replication.ingestion.swe_smith` — for users who download the HF dataset
125
+
126
+ Both follow the same `Iterator[TraceState]` contract.
127
+
128
+ ## Source
129
+
130
+ `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced
131
+ including direct inspection of the user's local sessions, 2026-05-26).
docs/adrs/ADR-003-diloco-impl.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-003 — DiLoCo implementation choice for Spike 008
2
+
3
+ **Status**: Accepted
4
+ **Date**: 2026-05-26
5
+ **Wave**: Phase 4 (deep work loop)
6
+
7
+ ## Context
8
+
9
+ Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
10
+ v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
11
+ real working integration, not a hand-rolled toy.
12
+
13
+ The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
14
+ inner steps, apply Nesterov-momentum outer step across replicas. We need a
15
+ PyTorch-compatible reference implementation that runs in single-process for
16
+ unit tests AND scales out on torch.distributed when we eventually run real
17
+ multi-replica training.
18
+
19
+ ## Options considered
20
+
21
+ | Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
22
+ |---|---|---|---|---|---|
23
+ | `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) |
24
+ | `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing |
25
+ | `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly |
26
+ | `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
27
+ | DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |
28
+
29
+ Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).
30
+
31
+ ## Decision
32
+
33
+ **`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active,
34
+ single-process testable).
35
+
36
+ Rationale:
37
+
38
+ 1. **Library, not research code.** Proper packaging on PyPI with prebuilt
39
+ wheels (`pip install torchft-nightly`), real test suite, version history,
40
+ maintained by Meta. The other live candidates are research codebases that
41
+ break on torch version bumps.
42
+
43
+ 2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts
44
+ `model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
45
+ `model_fragments=[model]` (single fragment, full-model sync) for vanilla
46
+ DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
47
+ to choose at the API level — both modes are one parameter apart.
48
+
49
+ 3. **Single-process unit-testable.** torchft's own tests use
50
+ `MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
51
+ shared-buffer mock allreduce that does real averaging across two
52
+ in-process replicas. Verified working pattern in the recon doc.
53
+
54
+ 4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
55
+ `perform_sync` (line 423).** Direct extension point — we can subclass or
56
+ monkey-patch these to compose with our Composer trainer.
57
+
58
+ ### Risks accepted (with mitigations)
59
+
60
+ | Risk | Mitigation |
61
+ |---|---|
62
+ | **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. |
63
+ | **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
64
+ | **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). |
65
+ | **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |
66
+
67
+ ## Consequences
68
+
69
+ ### Accepted
70
+
71
+ - Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
72
+ ready-to-paste pytest pattern as the smoke:
73
+ - 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
74
+ - shared-buffer mock allreduce (no NCCL)
75
+ - assertions: replica equality after sync, params actually moved, Nesterov
76
+ state populated, sync count matches expected
77
+
78
+ - `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
79
+ with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
80
+ versioned wheel.
81
+
82
+ - Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
83
+ v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
84
+
85
+ - The sign-convention mismatch is **made explicit** in our wrapper code with
86
+ a unit test that catches a sign flip if torchft ever inverts it.
87
+
88
+ ### Rejected paths
89
+
90
+ - **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test
91
+ surface for distributed-correctness is large; reusing a Meta-maintained
92
+ library cuts the audit burden.
93
+ - **`diloco_simple`.** Disqualified by the license absence alone.
94
+ - **`prime` / INTELLECT-1.** Right tool for production multi-node runs,
95
+ wrong tool for a single-process unit test.
96
+
97
+ ## Source
98
+
99
+ `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
100
+ from torchft repo cloned + read locally, 2026-05-26).
docs/research/DILOCO_RECONNAISSANCE.md ADDED
@@ -0,0 +1,357 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DiLoCo Reference Implementation Reconnaissance
2
+
3
+ **Date:** 2026-05-25
4
+ **Purpose:** Pick ONE PyTorch reference implementation of (Streaming) DiLoCo to bolt onto
5
+ the composer-replication-framework outer-loop optimizer. Feeds ADR-003.
6
+
7
+ **Bias:** simple + working > fancy + theoretically-better. Library > research codebase.
8
+
9
+ ---
10
+
11
+ ## TL;DR — Recommendation
12
+
13
+ **Use `meta-pytorch/torchft`'s `torchft.local_sgd.DiLoCo` context manager.**
14
+
15
+ It is a maintained library (not a research codebase), BSD-3 licensed, supports both
16
+ vanilla DiLoCo and Streaming DiLoCo through one class, and — critically — is unit-testable
17
+ in a single process by passing a `MagicMock(Manager)` whose `allreduce` returns a `_DummyWork`.
18
+ Their own `torchft/local_sgd_test.py` already demonstrates the exact pattern Spike 008 needs.
19
+
20
+ The Streaming DiLoCo paper (Liu et al. 2025, arXiv:2501.18512) has no separate community
21
+ implementation — torchft *is* the reference implementation as of mid-2026. PrimeIntellect's
22
+ two repos are either too minimal (`diloco_simple`, no LICENSE, NCCL-locked, no Streaming) or
23
+ deprecated (`OpenDiLoCo`, hivemind-based, "no longer maintained" per its own README).
24
+
25
+ ---
26
+
27
+ ## Candidates Audited (primary sources only)
28
+
29
+ ### A1. PrimeIntellect-ai/diloco_simple
30
+ - URL: https://github.com/PrimeIntellect-ai/diloco_simple
31
+ - License: **NONE** (no LICENSE file in repo — confirmed via `git clone` + `ls`).
32
+ All-rights-reserved by default under copyright law. **Cannot legally vendor or fork.**
33
+ - Last commit: **2024-05-31** (`be38ec4 add weight decay`).
34
+ - Activity: 8 commits total, ever. Two main authors. Effectively abandoned.
35
+ - Shape: single 180-LOC research script (`pure_torch_diloco.py`), pedagogical demo.
36
+ - Streaming DiLoCo? **No.** Vanilla DiLoCo only.
37
+ - Distributed: **Hard-coded NCCL via `torchrun`** + `init_process_group(backend="nccl")`.
38
+ Pulls in `wandb`, `transformers`, HuggingFace `datasets`, `cyclopts`, and trains a
39
+ full LlamaForCausalLM on C4. Not a library — a benchmark script.
40
+ - Verdict: **REJECT.** No license, no Streaming, no library API, NCCL-only, deps on
41
+ HF + wandb just to run. Useful as an *algorithm reference*, not as code to depend on.
42
+
43
+ ### A2. PrimeIntellect-ai/OpenDiLoCo
44
+ - URL: https://github.com/PrimeIntellect-ai/OpenDiloco
45
+ - License: present (Apache-2.0 typical, not re-verified — moot, see below).
46
+ - Status: **Officially deprecated.** README first paragraph:
47
+ > "**Important Notice**: OpenDiLoCo is no longer maintained. For our production-ready
48
+ > distributed training solution, please check out `prime`."
49
+ - Built on: `hivemind` (DHT-based decentralized training). Multi-machine only.
50
+ - Streaming DiLoCo? No.
51
+ - Verdict: **REJECT.** Deprecated by its authors. Hivemind dependency would force us to
52
+ set up DHT initial peers just to run a unit test.
53
+
54
+ ### A3. PrimeIntellect-ai/prime (a.k.a. INTELLECT-1 framework)
55
+ - URL: https://github.com/PrimeIntellect-ai/prime — note: the GitHub org now uses this
56
+ repo for their CLI/SDK; the original training framework was rebranded.
57
+ - The actual INTELLECT-1 training code uses an `ElasticDeviceMesh` abstraction and is
58
+ a full distributed training stack, not an algorithm library.
59
+ - Verdict: **REJECT.** Production framework, not a drop-in library. Coupling a 1.5k-LOC
60
+ fault-tolerant elastic mesh into our test framework is the opposite of "simple + working".
61
+
62
+ ### A4. DeepMind reference implementation (Douillard et al., arXiv:2311.08105)
63
+ - **No public reference implementation exists.** The DiLoCo paper is algorithm-only.
64
+ Confirmed: paper has no associated GitHub link in arXiv abstract or PDF; HuggingFace
65
+ papers page links no code. DeepMind has not open-sourced their internal trainer.
66
+ - Verdict: **N/A — does not exist.**
67
+
68
+ ### A5. meta-pytorch/torchft ← **CHOSEN**
69
+ - URL: https://github.com/meta-pytorch/torchft
70
+ - License: **BSD 3-Clause** (verified: `head -5 LICENSE` → "BSD 3-Clause License").
71
+ - Last commit on main: **2026-04-03** (HEAD `7eb7087 Add torchcomms ProcessGroup shim
72
+ for fault-tolerant reconfiguration`).
73
+ - Activity: 312 commits, multiple Meta contributors, recent commits across 2025 and 2026,
74
+ active CI, nightly PyPI builds at https://pypi.org/project/torchft-nightly/.
75
+ - Shape: **library**, not a research codebase. `torchft/` is a proper Python package with
76
+ `local_sgd.py`, `manager.py`, `process_group.py`, `local_sgd_test.py` (real pytest unit
77
+ tests), pyproject.toml, BSD-3.
78
+ - Streaming DiLoCo? **Yes** — the `DiLoCo` class is itself a Streaming DiLoCo
79
+ generalization (`fragment_sync_delay`, `fragment_update_alpha`); pass a single-element
80
+ `model_fragments=[model]` for vanilla DiLoCo.
81
+ - Source comment confirms: `"""... DiLoCo paper: https://arxiv.org/pdf/2311.08105 /
82
+ Streaming DiLoCo paper: https://arxiv.org/pdf/2501.18512 """`
83
+
84
+ ---
85
+
86
+ ## Deep Dive: torchft (the chosen one)
87
+
88
+ ### (1) Repo metadata
89
+ | Field | Value |
90
+ |---|---|
91
+ | URL | https://github.com/meta-pytorch/torchft |
92
+ | License | BSD 3-Clause |
93
+ | HEAD commit | `7eb7087` (2026-04-03) |
94
+ | Total commits on main | 312 |
95
+ | Activity level | **Active** — commits in 2025 + 2026, Meta-maintained, PyPI nightly builds |
96
+ | Distribution | `pip install torchft-nightly` (prebuilt wheels) **OR** install from source (requires Rust + protobuf-compiler + maturin — only because of the Lighthouse/process-group Rust ext, not the algorithm code) |
97
+ | Python | `requires-python = ">=3.8"`; `torch>=2.7` per `pyproject.toml` |
98
+
99
+ ### (2) Exact API / extension point
100
+
101
+ The integration target is `torchft/local_sgd.py`. Two relevant classes:
102
+
103
+ ```python
104
+ # Public class — drop-in context manager
105
+ class DiLoCo:
106
+ def __init__(
107
+ self,
108
+ manager: Manager, # we mock this
109
+ model_fragments: List[nn.Module], # [model] for vanilla DiLoCo
110
+ inner_optimizer: optim.Optimizer,
111
+ outer_optimizer: optim.Optimizer | list[optim.Optimizer],
112
+ sync_every: int, # N inner steps
113
+ backup_device: Optional[torch.device] = None,
114
+ pin_memory: bool = True,
115
+ use_bucketization: bool = False,
116
+ bucket_cap_mb: Optional[int] = None,
117
+ should_quantize: bool = False,
118
+ fragment_sync_delay: int = 0, # τ in Streaming DiLoCo paper
119
+ fragment_update_alpha: float = 0.0,
120
+ ) -> None: ...
121
+ ```
122
+
123
+ The **pseudo-gradient** is computed in `_StreamingDiLoCoFragment._save_grads()`
124
+ (`torchft/local_sgd.py` line 324):
125
+
126
+ ```python
127
+ def _save_grads(self) -> None:
128
+ """Saves pseudo-gradients of the parameters"""
129
+ with torch.no_grad():
130
+ for name, p in self._model_fragment.named_parameters():
131
+ local_param = p.to_local() if isinstance(p, DTensor) else p
132
+ pseudogradient = self.original_parameters[name].to(p.device) - local_param
133
+ self._grads[name] = pseudogradient
134
+ ```
135
+
136
+ Note the **sign**: `original − local` (i.e. `θ_initial − θ_local`). When this is later
137
+ copied into `p.grad` via `_set_grads`, an SGD step `p ← p − lr · grad` becomes
138
+ `p ← θ_initial − lr · (θ_initial − θ_local)` = a step *toward* `θ_local`. Our spec
139
+ says δ = θ_local − θ_initial; torchft uses the negation. Either convention works as
140
+ long as the outer optimizer's lr sign is consistent — torchft uses positive `outer_lr`
141
+ (e.g. 0.7) and SGD which subtracts the grad, so the math nets out. **Be careful when
142
+ unit-testing the sign in Spike 008.**
143
+
144
+ The **outer Nesterov step** is in `_StreamingDiLoCoFragment.perform_sync()` (line 423):
145
+
146
+ ```python
147
+ if should_commit:
148
+ self._set_grads() # write pseudogradient into p.grad
149
+ self._outer_optimizer.step() # Nesterov SGD step (user-provided)
150
+ self.save_parameters()
151
+ self._merge_parameters()
152
+ self._outer_optimizer.zero_grad()
153
+ ```
154
+
155
+ The Nesterov-ness lives in the user-provided outer optimizer, e.g.:
156
+ ```python
157
+ outer_optimizer = torch.optim.SGD(model.parameters(), lr=0.7, momentum=0.9, nesterov=True)
158
+ ```
159
+ This matches the DiLoCo paper exactly (Douillard §3 specifies Nesterov momentum outer).
160
+
161
+ The cross-replica all-reduce happens in `_average_grads()` (called from `prepare_sync`)
162
+ via `self._manager.allreduce(...)` — which is the seam we mock for single-process tests.
163
+
164
+ ### (3) torch.distributed dependency for testing?
165
+
166
+ **No, not for unit tests.** The `Manager` is mockable. From `torchft/local_sgd_test.py`:
167
+
168
+ ```python
169
+ from unittest.mock import create_autospec, MagicMock
170
+ from torchft.manager import Manager
171
+ from torchft.work import _DummyWork
172
+
173
+ def create_manager() -> MagicMock:
174
+ manager = create_autospec(Manager)
175
+ manager.errored.return_value = None
176
+ def mock_allreduce(tensor: torch.Tensor, should_quantize: bool = False):
177
+ return _DummyWork(tensor) # returns the same tensor unchanged
178
+ manager.allreduce.side_effect = mock_allreduce
179
+ return manager
180
+ ```
181
+
182
+ This bypasses NCCL/Gloo entirely. `_DummyWork` just wraps the tensor and returns it as
183
+ the "all-reduced" result, so a single-process test with `world_size=1` works directly,
184
+ and a 2-replica test is achieved by running two `DiLoCo` instances with two model
185
+ copies in the same process and a `mock_allreduce` that *averages* the two tensors
186
+ manually before returning. (Their `test_bucketization_correctness` does exactly this.)
187
+
188
+ For real distributed runs torchft uses Gloo or NCCL via `torchft.process_group`
189
+ (reconfigurable PGs that wrap `torch.distributed`). We do not need this for Spike 008.
190
+
191
+ ### (4) Library, research codebase, or paper-companion?
192
+
193
+ **Library.** Strong evidence:
194
+ - Proper Python package layout (`torchft/__init__.py`, modules per concern).
195
+ - Real unit tests (`*_test.py` per module) — not "run this script" demos.
196
+ - BSD-3-Clause LICENSE (vs. diloco_simple having none, signaling "personal demo").
197
+ - Nightly PyPI distribution (`torchft-nightly`) with prebuilt wheels.
198
+ - Documentation site at https://pytorch.org/torchft.
199
+ - `meta-pytorch` org — Meta-internally maintained; lives next to `torchtitan`.
200
+ - README explicitly: *"torchft is designed to provide the primitives required to
201
+ implement fault tolerance in any application/train script"* — i.e. a building block.
202
+
203
+ Only friction: installing **from source** needs Rust (pyo3 + maturin) and
204
+ protobuf-compiler. This is for the Rust Lighthouse/process-group extension which we
205
+ **do not need** for Spike 008's mock-based tests. Two clean options:
206
+ - (a) `pip install torchft-nightly` — uses prebuilt wheel, no Rust toolchain needed.
207
+ - (b) Vendor `torchft/local_sgd.py` + the few helpers (`work.py::_DummyWork`,
208
+ type stubs for `Manager`) into our repo under BSD-3 attribution. ~700 LOC total.
209
+
210
+ ### (5) Minimum viable test pattern for Spike 008
211
+
212
+ Goal: **2 replicas × 4 inner steps × 2 outer rounds on a tiny model**, single-process, no NCCL.
213
+
214
+ ```python
215
+ # spikes/008-diloco-outer-loop/tests/test_diloco_two_replicas.py
216
+ """
217
+ Spike 008: prove the DiLoCo outer-loop math is correct under our framework.
218
+ Runs entirely in a single process, no torch.distributed required.
219
+ """
220
+ import copy
221
+ import torch
222
+ import torch.nn as nn
223
+ import torch.optim as optim
224
+ from unittest.mock import create_autospec, MagicMock
225
+
226
+ from torchft.local_sgd import DiLoCo
227
+ from torchft.manager import Manager
228
+ from torchft.work import _DummyWork
229
+
230
+
231
+ class TinyMLP(nn.Module):
232
+ def __init__(self):
233
+ super().__init__()
234
+ self.net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 2))
235
+ def forward(self, x): return self.net(x)
236
+
237
+
238
+ def _make_avg_manager(replica_buffer):
239
+ """Manager whose allreduce averages tensors across replicas via shared buffer."""
240
+ mgr = create_autospec(Manager)
241
+ mgr._use_async_quorum = False
242
+ mgr.errored.return_value = None
243
+ mgr.should_commit.return_value = True
244
+ mgr.current_step.return_value = 0
245
+ def avg_allreduce(tensor, should_quantize=False):
246
+ # Cross-replica average: stash and average against the other replica's tensor
247
+ replica_buffer.append(tensor.clone())
248
+ if len(replica_buffer) == 2:
249
+ mean = (replica_buffer[0] + replica_buffer[1]) / 2.0
250
+ tensor.copy_(mean)
251
+ replica_buffer.clear()
252
+ return _DummyWork(tensor)
253
+ mgr.allreduce.side_effect = avg_allreduce
254
+ return mgr
255
+
256
+
257
+ def test_diloco_two_replicas_four_inner_two_outer():
258
+ torch.manual_seed(0)
259
+ model_a = TinyMLP()
260
+ model_b = copy.deepcopy(model_a) # identical init = same θ_initial
261
+
262
+ # Inner optimizers (one per replica)
263
+ inner_a = optim.AdamW(model_a.parameters(), lr=1e-3)
264
+ inner_b = optim.AdamW(model_b.parameters(), lr=1e-3)
265
+ # Outer Nesterov (one per replica, same hyperparams)
266
+ outer_a = optim.SGD(model_a.parameters(), lr=0.7, momentum=0.9, nesterov=True)
267
+ outer_b = optim.SGD(model_b.parameters(), lr=0.7, momentum=0.9, nesterov=True)
268
+
269
+ # Shared buffer — both DiLoCo wrappers funnel through one "process group" of size 2
270
+ buf = []
271
+ mgr_a = _make_avg_manager(buf)
272
+ mgr_b = _make_avg_manager(buf)
273
+
274
+ SYNC_EVERY = 4 # 4 inner steps per outer round
275
+ OUTER_ROUNDS = 2
276
+
277
+ with DiLoCo(mgr_a, [model_a], inner_a, outer_a, sync_every=SYNC_EVERY) as dla, \
278
+ DiLoCo(mgr_b, [model_b], inner_b, outer_b, sync_every=SYNC_EVERY) as dlb:
279
+ # Snapshot θ_initial
280
+ theta_initial_a = {n: p.detach().clone() for n, p in model_a.named_parameters()}
281
+
282
+ for outer_round in range(OUTER_ROUNDS):
283
+ for inner_step in range(SYNC_EVERY):
284
+ # Replicas see DIFFERENT data — that is the whole point of DiLoCo
285
+ x_a = torch.randn(8, 4) + 0.1 * outer_round
286
+ x_b = torch.randn(8, 4) - 0.1 * outer_round
287
+ y_a, y_b = torch.randn(8, 2), torch.randn(8, 2)
288
+
289
+ inner_a.zero_grad(); inner_b.zero_grad()
290
+ ((model_a(x_a) - y_a) ** 2).mean().backward()
291
+ ((model_b(x_b) - y_b) ** 2).mean().backward()
292
+ inner_a.step() # Inner step. Sync fires automatically inside post-hook
293
+ inner_b.step() # at step %% SYNC_EVERY == 0.
294
+
295
+ # Assertions:
296
+ # 1. Both replicas now hold IDENTICAL parameters (they were averaged via mock allreduce).
297
+ for (na, pa), (nb, pb) in zip(model_a.named_parameters(), model_b.named_parameters()):
298
+ torch.testing.assert_close(pa, pb, msg=f"Replicas diverged at {na}")
299
+
300
+ # 2. Parameters changed from θ_initial (outer optimizer actually stepped).
301
+ any_change = any(
302
+ not torch.equal(p, theta_initial_a[n]) for n, p in model_a.named_parameters()
303
+ )
304
+ assert any_change, "outer optimizer did not move the parameters"
305
+
306
+ # 3. The outer optimizer holds Nesterov momentum state for every parameter
307
+ # (proves the SGD(nesterov=True) actually ran).
308
+ n_params = len(list(model_a.parameters()))
309
+ assert len(outer_a.state_dict()["state"]) == n_params
310
+
311
+ # 4. Sync fired once per outer round per replica.
312
+ assert mgr_a.start_quorum.call_count == OUTER_ROUNDS
313
+ assert mgr_b.start_quorum.call_count == OUTER_ROUNDS
314
+ ```
315
+
316
+ **Why this works:**
317
+ - `DiLoCo` registers a post-step hook on `inner_optimizer` (see `__enter__`). The
318
+ hook increments `_local_step` and triggers `prepare_sync` / `perform_sync` on every
319
+ `sync_every` boundary — fully automatic, our test only calls `inner.step()`.
320
+ - `_DummyWork.wait()` is a no-op. `_average_grads` calls `manager.allreduce(...)`
321
+ which our `avg_allreduce` mocks to do real cross-replica averaging through `buf`.
322
+ - `manager.should_commit.return_value = True` lets the outer optimizer fire on each
323
+ outer round; setting it to `False` lets us also test rollback semantics.
324
+ - All single-process — pytest plays nicely. Add to
325
+ `spikes/005-integrated-trainer-skeleton/tests/` style or new `spikes/008/tests/`.
326
+
327
+ **Install for this spike:** `pip install torchft-nightly` in the eidolon venv. If the
328
+ nightly wheel proves brittle, fallback: vendor `local_sgd.py` + `work.py` + a
329
+ minimal `manager.py` stub (≈800 LOC) into `framework/diloco/_vendored/` with BSD-3
330
+ attribution.
331
+
332
+ ---
333
+
334
+ ## Risks & Mitigations
335
+
336
+ | Risk | Likelihood | Mitigation |
337
+ |---|---|---|
338
+ | `torchft-nightly` wheel breaks against torch 2.x | Med | Pin to a specific nightly hash; or vendor `local_sgd.py` directly under BSD-3. |
339
+ | `torchft.manager.Manager` import pulls in Rust ext at import time | Low | The class is importable as a type; `MagicMock` replaces it. If import touches Rust, we vendor. Verified: the import in `local_sgd.py` is `from torchft.manager import Manager` — only used as a type annotation in our test path. |
340
+ | Sign convention of pseudogradient causes our outer optimizer to move the wrong way | Med | Test 2 in the test pattern above explicitly checks "params moved from initial". A second test should compare the direction against a hand-computed expected. |
341
+ | `fragment_sync_delay > 0` (true Streaming) requires CUDA streams | Med | Spike 008 starts with `fragment_sync_delay=0` (= vanilla DiLoCo). Streaming variant deferred to Spike 009 once basic loop works. |
342
+ | Requires `torch>=2.7` per pyproject | Low | Framework already on torch 2.x; check exact pin. If <2.7, we vendor. |
343
+
344
+ ---
345
+
346
+ ## Decision (for ADR-003)
347
+
348
+ Adopt **`torchft.local_sgd.DiLoCo`** as the reference DiLoCo / Streaming DiLoCo
349
+ implementation. Integrate via `pip install torchft-nightly` for Spike 008. If
350
+ brittleness emerges, vendor `local_sgd.py` (BSD-3) into `framework/diloco/_vendored/`.
351
+
352
+ For the framework's outer-loop optimizer abstraction (the actual ADR-003 question):
353
+ mirror torchft's `DiLoCo(manager, [model_fragments], inner_opt, outer_opt, sync_every)`
354
+ constructor shape so that swapping our wrapper for the upstream class is a one-line
355
+ change. Compute pseudogradient as `θ_local − θ_initial` (our convention) and negate
356
+ when handing to the outer optimizer, OR follow torchft's `θ_initial − θ_local`
357
+ convention end-to-end. **Pick one and document it loudly.**
docs/research/MODAL_RECONNAISSANCE.md ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Modal Reconnaissance — Composer 2.5 Replication GPU Smoke
2
+
3
+ **Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
4
+ **Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
5
+ **Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
6
+ **Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.
7
+
8
+ ---
9
+
10
+ ## 1. Recommended Modal GPU type & estimated cost
11
+
12
+ ### 1.1 Pricing table (from primary source)
13
+
14
+ All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.
15
+
16
+ | GPU | Modal `gpu=` string | $ / sec | $ / hour | VRAM | Verdict for this smoke |
17
+ |----------------|----------------------|--------------|----------|--------|------------------------|
18
+ | Nvidia T4 | `"T4"` | 0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes |
19
+ | **Nvidia L4** | `"L4"` | **0.000222** | **0.799**| 24 GB | ✅ **Recommended** — cheapest GPU that fits comfortably |
20
+ | Nvidia A10 | `"A10"` | 0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
21
+ | Nvidia L40S | `"L40S"` | 0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B |
22
+ | Nvidia A100-40GB| `"A100-40GB"` | 0.000583 | 2.099 | 40 GB | Overkill |
23
+ | Nvidia A100-80GB| `"A100-80GB"` | 0.000694 | 2.498 | 80 GB | Overkill |
24
+ | Nvidia H100 | `"H100!"` | 0.001097 | 3.949 | 80 GB | Wasteful |
25
+
26
+ (`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)
27
+
28
+ **Auxiliary costs** (also primary, same page):
29
+ - CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
30
+ - RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
31
+ - Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
32
+ - Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.
33
+
34
+ ### 1.2 Why L4, not A10G or A100-40GB
35
+
36
+ The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*
37
+
38
+ Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
39
+ - Weights: ~1.0 GB
40
+ - Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
41
+ - Gradients (bf16): ~1 GB
42
+ - Activations for student fwd at B=2,T=1024: ~1–2 GB
43
+ - Teacher fwd (no grad, no act save): ~0.3 GB
44
+ - DPO chosen+rejected fwds (with grad): ~2–3 GB
45
+ - HF transformers overhead, KV scratch, framework: ~2 GB
46
+ - **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.
47
+
48
+ **A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.
49
+
50
+ **A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).
51
+
52
+ **T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.
53
+
54
+ ### 1.3 Cost projection for the actual smoke
55
+
56
+ Assume a 30-min wall-clock budget that breaks down realistically as:
57
+ - Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
58
+ - HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
59
+ - Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
60
+ - 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
61
+ - Logging, save, exit: 5 s.
62
+
63
+ **Realistic total: ~3–7 minutes of GPU-billed time per run.**
64
+
65
+ Cost per run on L4:
66
+ - Lower bound (3 min): 180 s × $0.000222 = **$0.040**
67
+ - Upper bound (7 min): 420 s × $0.000222 = **$0.093**
68
+ - Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037
69
+
70
+ **Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.
71
+
72
+ For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.
73
+
74
+ ### 1.4 Region & preemption multipliers (DON'T trip on these)
75
+
76
+ From the pricing-page footer:
77
+ - **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
78
+ - **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.
79
+
80
+ ---
81
+
82
+ ## 2. Minimal `modal_app.py` skeleton
83
+
84
+ This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.
85
+
86
+ ```python
87
+ """modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.
88
+
89
+ Goal: run ~50 forward+backward steps of the 3-channel loss
90
+ (GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
91
+ capture peak VRAM and per-step latency, and exit. Single L4, single container.
92
+
93
+ Run: modal run modal_app.py
94
+ Logs: the function's print() output streams to your terminal.
95
+ """
96
+
97
+ from __future__ import annotations
98
+
99
+ import modal
100
+
101
+ # ---------------------------------------------------------------------------
102
+ # 1) App + image
103
+ # ---------------------------------------------------------------------------
104
+ # Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
105
+ # Pin transformers/peft/trl to a known-good combination — the trainer skeleton
106
+ # was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
107
+ # If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
108
+ # correct override hook (DeepWiki audit anchor: huggingface/trl).
109
+ image = (
110
+ modal.Image.debian_slim(python_version="3.11")
111
+ .apt_install("git")
112
+ .pip_install(
113
+ "torch==2.4.1", # CUDA 12.1 wheel from PyPI default index
114
+ "transformers==4.46.3",
115
+ "accelerate==1.1.1",
116
+ "peft==0.14.0",
117
+ "trl==0.12.2",
118
+ "datasets==3.1.0",
119
+ "huggingface_hub==0.26.5",
120
+ )
121
+ .env({
122
+ # Force HF to use the mounted Volume for model + dataset cache.
123
+ "HF_HOME": "/cache/hf",
124
+ "TRANSFORMERS_CACHE": "/cache/hf",
125
+ "HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model
126
+ # Make Python flush prints immediately so we see step times live.
127
+ "PYTHONUNBUFFERED": "1",
128
+ # Reproducibility for the smoke.
129
+ "TOKENIZERS_PARALLELISM": "false",
130
+ })
131
+ )
132
+
133
+ # ---------------------------------------------------------------------------
134
+ # 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
135
+ # ---------------------------------------------------------------------------
136
+ # 1 GB of Qwen weights persists here. First run pays the download cost,
137
+ # every subsequent run reuses the volume. Below 1 TiB / mo: free.
138
+ hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)
139
+
140
+ # ---------------------------------------------------------------------------
141
+ # 3) App + secrets
142
+ # ---------------------------------------------------------------------------
143
+ app = modal.App("composer-replication-smoke")
144
+
145
+ # Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
146
+ hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety
147
+
148
+ # ---------------------------------------------------------------------------
149
+ # 4) The smoke function
150
+ # ---------------------------------------------------------------------------
151
+ @app.function(
152
+ image=image,
153
+ gpu="L4", # see §1: cheapest 24 GB option that fits
154
+ cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B
155
+ memory=16 * 1024, # 16 GiB RAM is plenty
156
+ volumes={"/cache": hf_cache},
157
+ timeout=60 * 30, # hard 30-min cap matches the smoke spec
158
+ secrets=[hf_secret],
159
+ # NB: keep preemptible (default). Don't pay 3× to pin.
160
+ # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
161
+ )
162
+ def smoke():
163
+ import time
164
+ import torch
165
+ from transformers import AutoModelForCausalLM, AutoTokenizer
166
+
167
+ MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
168
+
169
+ print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
170
+ f"device={torch.cuda.get_device_name(0)} "
171
+ f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
172
+
173
+ # -------------------------------------------------------------------
174
+ # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
175
+ # -------------------------------------------------------------------
176
+ t0 = time.perf_counter()
177
+ tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
178
+ model = AutoModelForCausalLM.from_pretrained(
179
+ MODEL_ID,
180
+ cache_dir="/cache/hf",
181
+ torch_dtype=torch.bfloat16,
182
+ device_map="cuda:0",
183
+ )
184
+ model.train()
185
+ print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
186
+ f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")
187
+
188
+ # -------------------------------------------------------------------
189
+ # 50-step verification loop.
190
+ #
191
+ # NOTE: this stub uses a synthetic batch — a single forward+backward
192
+ # against an LM-head loss — *not* the full 3-channel loss. The point
193
+ # is to (a) verify the Modal harness, (b) measure the per-step time
194
+ # of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
195
+ #
196
+ # Replace the body of the for-loop with the actual ComposerReplicationTrainer
197
+ # `_compute_loss` call once data_collator outputs are stubbed/mocked.
198
+ # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
199
+ # -------------------------------------------------------------------
200
+ optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)
201
+
202
+ # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
203
+ B, T = 2, 1024
204
+ input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
205
+ labels = input_ids.clone()
206
+
207
+ torch.cuda.reset_peak_memory_stats()
208
+ step_times = []
209
+ for step in range(50):
210
+ t = time.perf_counter()
211
+ out = model(input_ids=input_ids, labels=labels)
212
+ out.loss.backward()
213
+ optimizer.step()
214
+ optimizer.zero_grad(set_to_none=True)
215
+ torch.cuda.synchronize()
216
+ dt = time.perf_counter() - t
217
+ step_times.append(dt)
218
+ if step % 10 == 0:
219
+ print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
220
+ f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")
221
+
222
+ # -------------------------------------------------------------------
223
+ # Final report.
224
+ # -------------------------------------------------------------------
225
+ median_ms = sorted(step_times)[len(step_times)//2] * 1000
226
+ p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
227
+ peak_gb = torch.cuda.max_memory_allocated() / 1e9
228
+ print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
229
+ f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")
230
+
231
+ # Persist cache for the next run.
232
+ hf_cache.commit()
233
+
234
+
235
+ @app.local_entrypoint()
236
+ def main():
237
+ smoke.remote()
238
+ ```
239
+
240
+ ### 2.1 What's deliberately *not* in the skeleton
241
+
242
+ - **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
243
+ - **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
244
+ - **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
245
+ - **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
246
+ - **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
247
+ - **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.
248
+
249
+ ### 2.2 What you do need to add when you wire the real loss
250
+
251
+ Replace the synthetic `for step in range(50)` body with:
252
+
253
+ ```python
254
+ from data_collator import ComposerDataCollator # spike 005 path
255
+ from trl_path.composer_trainer import ComposerReplicationTrainer
256
+ # ...
257
+ # Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
258
+ # inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
259
+ # point is to verify the loss path, not the rollout path.
260
+ ```
261
+
262
+ The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.
263
+
264
+ ---
265
+
266
+ ## 3. Gotchas that bite *this specific workload*
267
+
268
+ The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B���30B training. Most of them don't apply here. The ones that do:
269
+
270
+ ### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly
271
+
272
+ `ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):
273
+
274
+ ```python
275
+ student_logits = model(input_ids=inputs["input_ids"]).logits # with grad
276
+ with torch.no_grad():
277
+ teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
278
+ ```
279
+
280
+ Two issues:
281
+
282
+ 1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
283
+ 2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.
284
+
285
+ ### 3.2 The DPO channel does TWO more grad'd forwards per step
286
+
287
+ `_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:
288
+
289
+ | Forward | Grad? | Notes |
290
+ |---------|-------|-------|
291
+ | `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
292
+ | Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
293
+ | Teacher in SDPO | no | hint-conditioned context |
294
+ | DPO chosen | yes | only when beta_replay ≠ 0 |
295
+ | DPO rejected | yes | only when beta_replay ≠ 0 |
296
+
297
+ **That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.
298
+
299
+ For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.
300
+
301
+ ### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun
302
+
303
+ In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:
304
+
305
+ ```python
306
+ return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
307
+ ```
308
+
309
+ This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.
310
+
311
+ Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.
312
+
313
+ ### 3.4 `torch.cuda.synchronize()` before timing reads
314
+
315
+ If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.
316
+
317
+ ### 3.5 The HF cache Volume must be `commit()`ed
318
+
319
+ From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.
320
+
321
+ ### 3.6 What does NOT bite
322
+
323
+ These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:
324
+
325
+ - ❌ FSDP / DeepSpeed sharding setup. Single GPU.
326
+ - ❌ `accelerate launch` / multi-process distributed. Single GPU.
327
+ - ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
328
+ - ❌ Tensor parallelism / sequence parallelism. Single GPU.
329
+ - ❌ Multi-node clusters. Single node.
330
+ - ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
331
+ - ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
332
+ - ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.
333
+
334
+ ---
335
+
336
+ ## 4. Decision rule: Modal vs the local 5090
337
+
338
+ ### 4.1 The numbers
339
+
340
+ **Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
341
+ - Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
342
+ - 50 steps: **~15 seconds of pure compute**.
343
+ - Plus model load (one-time, from local HF cache): ~5 seconds.
344
+ - Plus data collator setup: ~3 seconds.
345
+ - **Wall clock: ~25–40 seconds.**
346
+ - **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).
347
+
348
+ **Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
349
+ - Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
350
+ - 50 steps: **~100 seconds of pure compute**.
351
+ - Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
352
+ - **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
353
+ - **Cost: $0.08–$0.13 per run.**
354
+
355
+ ### 4.2 The decision rule
356
+
357
+ > **For this specific 30-min smoke: run on the 5090. Do not use Modal.**
358
+
359
+ Reasoning:
360
+
361
+ 1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
362
+ 2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
363
+ 3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
364
+ 4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
365
+ 5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.
366
+
367
+ **When Modal becomes correct:**
368
+
369
+ | Scenario | Modal? | Why |
370
+ |----------|--------|-----|
371
+ | 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
372
+ | Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
373
+ | Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
374
+ | Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
375
+ | 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |
376
+
377
+ ### 4.3 Recommended workflow
378
+
379
+ 1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
380
+ 2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
381
+ 3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.
382
+
383
+ ---
384
+
385
+ ## 5. References
386
+
387
+ All claims in this document are sourced from:
388
+
389
+ - **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
390
+ - **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
391
+ - **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
392
+ - **Volumes**: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence.
393
+ - **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
394
+ - **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
395
+ - **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
396
+ - **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
397
+ - **Local trainer code reviewed**:
398
+ - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
399
+ - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`
400
+
401
+ ---
402
+
403
+ ## 6. TL;DR
404
+
405
+ 1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
406
+ 2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
407
+ 3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
408
+ 4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.
docs/research/TRACE_SOURCE_RECONNAISSANCE.md ADDED
@@ -0,0 +1,403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TRACE_SOURCE_RECONNAISSANCE.md
2
+
3
+ Spike 007 trace-source audit, feeding ADR-002.
4
+
5
+ Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).
6
+
7
+ ---
8
+
9
+ ## 0. TL;DR
10
+
11
+ Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.
12
+
13
+ The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.
14
+
15
+ ---
16
+
17
+ ## 1. Context: TraceExample dataclass field reality
18
+
19
+ **Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
20
+ `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:
21
+
22
+ ```python
23
+ class TraceState(TypedDict):
24
+ state_id: str # unique within the trace
25
+ messages: list[dict] # conversation up to and including this step's user prompt
26
+ student_action: str # what the student actually did at this step
27
+
28
+ class DPOPair(TypedDict):
29
+ state_id: str
30
+ state_messages: list[dict]
31
+ chosen: str # teacher-consensus action
32
+ rejected: str # student action
33
+ n_teachers_agreeing: int
34
+ ```
35
+
36
+ The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle.
37
+
38
+ ---
39
+
40
+ ## 2. Candidate audit summary
41
+
42
+ Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.
43
+
44
+ | # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict |
45
+ |---|---|---|---|---|---|---|
46
+ | **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** |
47
+ | b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject |
48
+ | c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up |
49
+ | d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject |
50
+ | e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set |
51
+ | f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** |
52
+
53
+ The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.
54
+
55
+ ---
56
+
57
+ ## 3. Chosen format spec — Claude Code session JSONL
58
+
59
+ ### 3.1 Location and naming
60
+
61
+ - **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
62
+ Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
63
+ - **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
64
+ Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
65
+ - **File**: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions.
66
+ Source: same `claude_skills` doc, §"Subagent File Location".
67
+ - **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
68
+ Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")
69
+
70
+ ### 3.2 Common record fields
71
+
72
+ Every record (both user and assistant types) carries:
73
+
74
+ | field | type | meaning |
75
+ |---|---|---|
76
+ | `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) |
77
+ | `uuid` | `string` | This record's UUID |
78
+ | `sessionId` | `string` | UUID of the session (matches filename) |
79
+ | `timestamp` | `string` (ISO-8601) | Wall-clock time of the record |
80
+ | `cwd` | `string` | Absolute working directory |
81
+ | `version` | `string` | Claude Code version (e.g. `"2.1.143"`) |
82
+ | `gitBranch` | `string` | Empty string `""` when not in a git repo |
83
+ | `isSidechain` | `boolean` | True for sub-agent (Task tool) chains |
84
+ | `userType` | `string` | `"external"` or similar |
85
+ | `type` | `string` | Discriminator — see §3.3 |
86
+ | `entrypoint` | `string` | e.g. `"sdk-cli"` |
87
+
88
+ Sources for these fields:
89
+ - <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
90
+ - <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
91
+ - <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
92
+ - Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.
93
+
94
+ ### 3.3 Record types (`type` discriminator)
95
+
96
+ | `type` | Role |
97
+ |---|---|
98
+ | `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) |
99
+ | `assistant` | Model output: text, `thinking`, and `tool_use` blocks |
100
+ | `system` | Hook summaries, stop notices |
101
+ | `summary` | Context-compaction markers |
102
+ | `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output |
103
+ | `queue-operation` | Prompt enqueue/dequeue events |
104
+ | `file-history-snapshot` | File-state tracking for undo |
105
+ | `last-prompt` | Bookkeeping for resume |
106
+
107
+ Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.
108
+
109
+ ### 3.4 The two record types we care about
110
+
111
+ #### Assistant record carrying a tool call (the "student action")
112
+
113
+ Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:
114
+
115
+ ```json
116
+ {
117
+ "type": "assistant",
118
+ "uuid": "24a16a51-3133-4ba5-9d23-472864286154",
119
+ "parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
120
+ "sessionId": "39df59f0-…",
121
+ "timestamp": "2026-05-16T04:52:21.947Z",
122
+ "message": {
123
+ "role": "assistant",
124
+ "model": "claude-opus-4-7",
125
+ "content": [
126
+ {
127
+ "type": "tool_use",
128
+ "id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
129
+ "name": "Bash",
130
+ "input": {
131
+ "command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
132
+ "description": "Check builder agent inbox"
133
+ }
134
+ }
135
+ ],
136
+ "stop_reason": "tool_use",
137
+ "usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
138
+ }
139
+ }
140
+ ```
141
+
142
+ The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).
143
+
144
+ #### User record carrying a tool result (the "observation")
145
+
146
+ ```json
147
+ {
148
+ "type": "user",
149
+ "uuid": "b9f9414b-…",
150
+ "parentUuid": "24a16a51-…", // matches the assistant uuid above
151
+ "sessionId": "39df59f0-…",
152
+ "timestamp": "2026-05-16T04:52:23.229Z",
153
+ "message": {
154
+ "role": "user",
155
+ "content": [
156
+ {
157
+ "tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
158
+ "type": "tool_result",
159
+ "content": " No new messages",
160
+ "is_error": false
161
+ }
162
+ ]
163
+ },
164
+ "toolUseResult": { // duplicate, structured form
165
+ "stdout": " No new messages",
166
+ "stderr": "",
167
+ "interrupted": false,
168
+ "isImage": false,
169
+ "noOutputExpected": false
170
+ },
171
+ "sourceToolAssistantUUID": "24a16a51-…" // back-pointer to the assistant uuid
172
+ }
173
+ ```
174
+
175
+ User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).
176
+
177
+ ### 3.5 Schema stability
178
+
179
+ - **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema.
180
+ - **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
181
+ - **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).
182
+
183
+ ### 3.6 Licensing
184
+
185
+ - The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
186
+ - The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
187
+ - Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
188
+ - We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests.
189
+
190
+ ---
191
+
192
+ ## 4. Acquiring the 5 real example traces
193
+
194
+ **Zero acquisition cost.** All five live on this machine right now.
195
+
196
+ Discovery command (used during this audit):
197
+
198
+ ```bash
199
+ find ~/.claude/projects -name "*.jsonl" 2>/dev/null
200
+ # → 1015 files
201
+ ```
202
+
203
+ Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:
204
+
205
+ | # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path |
206
+ |---|---|---|---|---|---|
207
+ | 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` |
208
+ | 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` |
209
+ | 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` |
210
+ | 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` |
211
+ | 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` |
212
+
213
+ (All five inspected programmatically during this audit — counts above are real, not estimates.)
214
+
215
+ For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces.
216
+
217
+ ---
218
+
219
+ ## 5. Decision-relevant tradeoffs vs runners-up
220
+
221
+ ### Why we are NOT picking OpenHands trajectories (c)
222
+ - **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing).
223
+ - **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
224
+ - **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
225
+ - **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.
226
+
227
+ ### Why we are NOT picking SWE-bench leaderboard trajectories (e)
228
+ - **Pro**: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
229
+ - **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
230
+ - **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike.
231
+
232
+ ### Why we are NOT picking Aider (d)
233
+ - The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
234
+ - **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.
235
+
236
+ ### Why we are NOT picking Cline (b)
237
+ - No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.
238
+
239
+ ### Why we are NOT picking SWE-smith-trajectories (f)
240
+ - This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
241
+ - **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.
242
+
243
+ ---
244
+
245
+ ## 6. TraceIngester sketch
246
+
247
+ Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
248
+
249
+ ```python
250
+ # spikes/007-trace-ingester/trace_ingester.py
251
+ from __future__ import annotations
252
+ import json
253
+ from collections.abc import Iterator
254
+ from pathlib import Path
255
+ from typing import Any
256
+
257
+ # Re-use the existing TypedDicts from spike-005:
258
+ # from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState
259
+
260
+ # A "step" in the trace is each assistant record that ends in tool_use. The
261
+ # state visible to the model at that step = all messages strictly before it,
262
+ # in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).
263
+
264
+ def _record_to_chat_message(rec: dict) -> dict | None:
265
+ """Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
266
+ dict, or return None for non-conversational records (queue-operation,
267
+ attachment, file-history-snapshot, system, last-prompt, summary)."""
268
+ t = rec.get("type")
269
+ if t not in ("user", "assistant"):
270
+ return None
271
+ msg = rec.get("message")
272
+ if not isinstance(msg, dict):
273
+ return None
274
+ role = msg.get("role")
275
+ content = msg.get("content")
276
+ if role not in ("user", "assistant") or content is None:
277
+ return None
278
+ # Strip thinking blocks — they are not portable across teacher models and
279
+ # should not influence the teacher's decision at replay time.
280
+ if isinstance(content, list):
281
+ content = [c for c in content
282
+ if not (isinstance(c, dict) and c.get("type") == "thinking")]
283
+ return {"role": role, "content": content}
284
+
285
+
286
+ def _serialize_action(content_blocks: list[dict]) -> str:
287
+ """Canonicalize the student's action at a step.
288
+
289
+ For tool_use steps: JSON-encode the (name, input) pairs.
290
+ For text-only steps: return the concatenated text.
291
+ """
292
+ tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
293
+ if tool_uses:
294
+ return json.dumps(
295
+ [{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
296
+ sort_keys=True,
297
+ )
298
+ texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
299
+ return "\n".join(t for t in texts if t)
300
+
301
+
302
+ class TraceIngester:
303
+ """Reads a Claude Code session JSONL and yields TraceState records.
304
+
305
+ One TraceState is emitted per assistant record. The `messages` field is the
306
+ full prior conversation (system + alternating user/assistant) up to but not
307
+ including the current assistant turn; `student_action` is the canonicalized
308
+ serialization of that turn's content blocks.
309
+ """
310
+
311
+ def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
312
+ self.skip_thinking = skip_thinking
313
+ self.min_action_chars = min_action_chars
314
+
315
+ def ingest(self, path: str | Path) -> Iterator[dict]: # yields TraceState
316
+ path = Path(path)
317
+ prior_messages: list[dict] = []
318
+ session_id_for_state = path.stem # filename = session UUID
319
+
320
+ with path.open("r", encoding="utf-8") as f:
321
+ for line_idx, line in enumerate(f):
322
+ line = line.strip()
323
+ if not line:
324
+ continue
325
+ try:
326
+ rec = json.loads(line)
327
+ except json.JSONDecodeError:
328
+ continue # tolerate truncated last-line writes
329
+
330
+ chat_msg = _record_to_chat_message(rec)
331
+ if chat_msg is None:
332
+ continue
333
+
334
+ if chat_msg["role"] == "assistant":
335
+ # Emit a TraceState representing "before this turn".
336
+ blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
337
+ student_action = _serialize_action(blocks)
338
+ if len(student_action) >= self.min_action_chars:
339
+ yield {
340
+ "state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
341
+ "messages": list(prior_messages), # snapshot
342
+ "student_action": student_action,
343
+ }
344
+ # Append to history regardless (so subsequent turns see it).
345
+ prior_messages.append(chat_msg)
346
+ ```
347
+
348
+ Notes:
349
+ - We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
350
+ - We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
351
+ - `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
352
+ - Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.
353
+
354
+ ### 6.1 Smoke-test plan (for Spike 007 itself)
355
+
356
+ ```python
357
+ ingester = TraceIngester()
358
+ states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
359
+ # Expect roughly 197 states (matches asst-message count counted in §4).
360
+ # Then teacher-replay on the first 5 states, confirm cost is in the
361
+ # spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
362
+ ```
363
+
364
+ Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up.
365
+
366
+ ---
367
+
368
+ ## 7. Open questions for ADR-002
369
+
370
+ 1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
371
+ 2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
372
+ 3. Synthetic system prompt at replay time — yes/no? If yes, what content?
373
+ 4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
374
+ 5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.
375
+
376
+ ---
377
+
378
+ ## 8. References (primary sources only)
379
+
380
+ Anthropic / Claude Code official:
381
+ - <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
382
+ - <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
383
+ - <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
384
+ - <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license
385
+
386
+ Community schemas (reverse-engineered from real session data):
387
+ - <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
388
+ - <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
389
+ - <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
390
+ - <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
391
+ - <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs
392
+
393
+ Runners-up reference points:
394
+ - OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
395
+ - SWE-bench experiments: <https://github.com/swe-bench/experiments>
396
+ - SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
397
+ - mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
398
+ - swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
399
+ - Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>
400
+
401
+ Internal references:
402
+ - `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
403
+ - Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating