Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
Browse filesImports the gap-closer wave from VISION_VALIDATION.md as a structured backlog,
records the deep work loop log, and locks three architecture decisions backed
by primary-source research from three subagent recons.
Backlog items (CPU-only, no GPU budget):
- Spike 006 — real HF model smoke (Wave 7 next)
- Spike 007 — real trace ingestion (Wave 8)
- Spike 008 — Streaming DiLoCo smoke (Wave 9)
- Wave 10 — packaging (pyproject.toml + examples/)
- Spike 002a-mini — Modal-gated GPU smoke (Phase 10)
Research (docs/research/, 1168 lines total):
- MODAL_RECONNAISSANCE.md — Modal pricing + setup, primary-sourced from
modal.com/pricing and modal.com/docs. Verdict: Modal L4 is $0.08-0.13 per
smoke run but loses to local 5090 by 10x on iteration cycle (3-5min vs
25-40s). Modal becomes correct for parallel sweeps, 7B+ models, multi-node
training, or CI repro — none of which apply to gap-closer wave.
- DILOCO_RECONNAISSANCE.md — audited 5 candidates. meta-pytorch/torchft wins:
BSD-3, 312 commits, HEAD 2026-04-03, library-not-research-code, prebuilt
wheels, single-process unit-testable via MagicMock(Manager) pattern. Their
DiLoCo class IS the Streaming generalization (vanilla = single fragment).
Sign-convention mismatch flagged for explicit test in Spike 008.
- TRACE_SOURCE_RECONNAISSANCE.md — corrected the existing dataclass (it's
TraceState TypedDict, not TraceExample). Recommended Claude Code session
JSONL: 1015 local sessions on this machine, zero acquisition cost, 6762
tool_use messages across 5 pre-selected sessions, schema validated by 4
independent community projects + JSON Schema validated against ~50000 real
messages.
ADRs (docs/adrs/):
- ADR-001 — GPU venue: local 5090. Modal stashed for parallel sweeps and 7B+
workloads. Migration path documented.
- ADR-002 — Trace source: Claude Code session JSONL. Pattern opens door for
OpenHands and SWE-smith ingesters in v0.2.
- ADR-003 — DiLoCo impl: torchft.local_sgd.DiLoCo with shared-buffer mock
allreduce for Spike 008 single-process test. Sign-convention mismatch
caught with explicit unit test.
Deep work loop log: docs/DEEP_WORK_LOOP_LOG.md tracks all 12 phases with
status. Phases 1-4 complete; phase 5 (planning waves 7-10) is next.
Tests still 38/38 green; no code changes in this wave.
Refs: docs/VISION_VALIDATION.md gaps V2/V4/V5/V8.
- BACKLOG.md +92 -0
- docs/DEEP_WORK_LOOP_LOG.md +47 -0
- docs/adrs/ADR-001-gpu-venue.md +102 -0
- docs/adrs/ADR-002-trace-source.md +131 -0
- docs/adrs/ADR-003-diloco-impl.md +100 -0
- docs/research/DILOCO_RECONNAISSANCE.md +357 -0
- docs/research/MODAL_RECONNAISSANCE.md +408 -0
- docs/research/TRACE_SOURCE_RECONNAISSANCE.md +403 -0
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Backlog — Composer 2.5 Replication Framework
|
| 2 |
+
|
| 3 |
+
Imported from `docs/VISION_VALIDATION.md` § 6 (gaps) + § 9 (gap-closers) at 2026-05-26.
|
| 4 |
+
|
| 5 |
+
## Active items (CPU-only, no GPU budget)
|
| 6 |
+
|
| 7 |
+
### Spike 006 — Real HF model smoke (Wave 7)
|
| 8 |
+
|
| 9 |
+
**Closes**: V8 ("any HF model") — currently we run only mock 4-layer toy LM through `composer_total_loss`.
|
| 10 |
+
|
| 11 |
+
**Goal**: prove the 3-channel loss (`grpo + α·sdpo_kl + β·trace_replay_dpo`) survives a real `transformers` model + tokenizer with finite gradients and a decreasing loss across N steps.
|
| 12 |
+
|
| 13 |
+
**Acceptance**:
|
| 14 |
+
1. `AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")` loads on CPU.
|
| 15 |
+
2. Real tokenizer `apply_chat_template` produces `input_ids` shape that flows through `composer_total_loss(model, batch)` without mock shapes.
|
| 16 |
+
3. 5 backward steps run on CPU without `nan` / `inf` / shape mismatch.
|
| 17 |
+
4. Loss is monotone non-increasing across 5 steps (trend; allow noise).
|
| 18 |
+
5. New tests added under `spikes/006-real-hf-model-smoke/tests/` pass alongside existing 38.
|
| 19 |
+
|
| 20 |
+
**Estimate**: half a day, CPU only.
|
| 21 |
+
|
| 22 |
+
### Spike 007 — Real trace ingestion (Wave 8)
|
| 23 |
+
|
| 24 |
+
**Closes**: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces."
|
| 25 |
+
|
| 26 |
+
**Goal**: pick ONE real agent-session log format with stable, public schema, write a `TraceIngester` that converts it to our `TraceExample` dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states.
|
| 27 |
+
|
| 28 |
+
**Acceptance**:
|
| 29 |
+
1. ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories).
|
| 30 |
+
2. `TraceIngester.ingest(path: Path) -> Iterator[TraceExample]` is implemented + has unit tests with a fixture log file.
|
| 31 |
+
3. End-to-end smoke: real trace → ingester → collator → 1-step `composer_total_loss` runs without error.
|
| 32 |
+
4. Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to `spikes/007-*/verdict.md`.
|
| 33 |
+
|
| 34 |
+
**Estimate**: 1 day + ~$2 OpenRouter.
|
| 35 |
+
|
| 36 |
+
### Spike 008 — Streaming DiLoCo smoke (Wave 9)
|
| 37 |
+
|
| 38 |
+
**Closes**: V2 (DiLoCo "deferred to v0.2" — drift from original brief).
|
| 39 |
+
|
| 40 |
+
**Goal**: bolt outer-loop pseudo-gradient sync onto the loss composition test using two `nn.Module` replicas on the same node. No real distributed training (CPU multiprocessing or single-process).
|
| 41 |
+
|
| 42 |
+
**Acceptance**:
|
| 43 |
+
1. ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo).
|
| 44 |
+
2. `outer_optimizer.py` implements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step.
|
| 45 |
+
3. Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance.
|
| 46 |
+
4. 38 existing tests still pass (no regression).
|
| 47 |
+
|
| 48 |
+
**Estimate**: 2 days, CPU.
|
| 49 |
+
|
| 50 |
+
### Wave 10 — Packaging
|
| 51 |
+
|
| 52 |
+
**Closes**: V4 ("skeleton not framework").
|
| 53 |
+
|
| 54 |
+
**Goal**: turn the assemblage of spike directories into an installable Python package with a clear quickstart.
|
| 55 |
+
|
| 56 |
+
**Acceptance**:
|
| 57 |
+
1. `pyproject.toml` at repo root, package name `composer_replication`.
|
| 58 |
+
2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
|
| 59 |
+
3. `examples/qwen3_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
|
| 60 |
+
4. README quickstart updated to `pip install -e .` + `python examples/qwen3_05b_quickstart/run.py`.
|
| 61 |
+
5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
|
| 62 |
+
|
| 63 |
+
**Estimate**: half a day, CPU.
|
| 64 |
+
|
| 65 |
+
## Modal-gated (if budget allows after gap-closers)
|
| 66 |
+
|
| 67 |
+
### Spike 002a-mini — Real GPU smoke (Phase 10)
|
| 68 |
+
|
| 69 |
+
**Closes**: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only.
|
| 70 |
+
|
| 71 |
+
**Goal**: dispatch a 30-min A10G smoke on Modal that runs Spike 006 unchanged on GPU, verifies bf16 numerics, captures memory + step-time.
|
| 72 |
+
|
| 73 |
+
**Acceptance**:
|
| 74 |
+
1. ADR-001 says Modal is the right choice for this workload + estimate is < $5.
|
| 75 |
+
2. Modal app builds, runs `composer_total_loss` for 50 steps on Qwen2.5-0.5B-Instruct.
|
| 76 |
+
3. Loss curve + memory profile saved to `spikes/002a-mini/` and pulled to local.
|
| 77 |
+
4. No new shape / dtype bug surfaced vs CPU run.
|
| 78 |
+
|
| 79 |
+
**Estimate**: $1–3, 30 min wall-clock.
|
| 80 |
+
|
| 81 |
+
## Deferred (post-loop, GPU-gated)
|
| 82 |
+
|
| 83 |
+
- Spike 002a/002b — full trace collection on A100 ($30–50)
|
| 84 |
+
- Spike 003 — DPO-pair signal density study
|
| 85 |
+
- Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0
|
| 86 |
+
- Publication wave — author identity, thumbnail, X tags, post sequence
|
| 87 |
+
|
| 88 |
+
## Process notes
|
| 89 |
+
|
| 90 |
+
- Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks.
|
| 91 |
+
- Each spike has its own `spikes/00N-name/` dir + `verdict.md` recording acceptance + delta from estimate.
|
| 92 |
+
- Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs.
|
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep Work Loop Log — Composer 2.5 Replication Framework
|
| 2 |
+
|
| 3 |
+
Started: 2026-05-26
|
| 4 |
+
Operator: Codeseys (Hermes Agent autonomous loop)
|
| 5 |
+
Skill: `deep-work-loop` v1.0.0
|
| 6 |
+
|
| 7 |
+
## Vision
|
| 8 |
+
|
| 9 |
+
> Take any HuggingFace model → further RL train it using:
|
| 10 |
+
> 1. RLVR (tests-pass reward),
|
| 11 |
+
> 2. SDPO/hint-distillation (Composer 2.5's "targeted RL with textual feedback"),
|
| 12 |
+
> 3. multi-teacher trace-replay DPO,
|
| 13 |
+
> integrated against TRL/VeRL/OpenEnv with DiLoCo-style outer loop sync.
|
| 14 |
+
>
|
| 15 |
+
> Output: a published, reproducible framework — the "Composer 2.5 replication" the open ecosystem is missing.
|
| 16 |
+
|
| 17 |
+
## Starting state
|
| 18 |
+
|
| 19 |
+
- HEAD: `040eff8` (Wave 6: vision validation self-audit, 5/10 scorecard)
|
| 20 |
+
- Tests: 38/38 green in `spikes/005-integrated-trainer-skeleton/`
|
| 21 |
+
- Working tree: clean
|
| 22 |
+
|
| 23 |
+
## Phase ledger
|
| 24 |
+
|
| 25 |
+
| Phase | Description | Status | Started | Done |
|
| 26 |
+
|---|---|---|---|---|
|
| 27 |
+
| 1 | commit-state | ✅ | 2026-05-26 | 2026-05-26 |
|
| 28 |
+
| 2 | backlog-audit (BACKLOG.md from VISION_VALIDATION) | ✅ | 2026-05-26 | 2026-05-26 |
|
| 29 |
+
| 3 | parallel-research (3 subagents) | 🟡 | 2026-05-26 | |
|
| 30 |
+
| 4 | architect with ADRs (ADR-001..003) | ⏳ | | |
|
| 31 |
+
| 5 | plan in waves (W7–W10) | ⏳ | | |
|
| 32 |
+
| 6 | execute W7 — Spike 006 (real HF model smoke) | ⏳ | | |
|
| 33 |
+
| 7 | execute W8 — Spike 007 (real trace ingestion) | ⏳ | | |
|
| 34 |
+
| 8 | execute W9 — Spike 008 (DiLoCo smoke) | ⏳ | | |
|
| 35 |
+
| 9 | execute W10 — packaging | ⏳ | | |
|
| 36 |
+
| 10 | (Modal-gated) Spike 002a-mini real GPU smoke | ⏳ | | |
|
| 37 |
+
| 11 | cross-model-final-review | ⏳ | | |
|
| 38 |
+
| 12 | update scorecard + push | ⏳ | | |
|
| 39 |
+
|
| 40 |
+
## Constraints
|
| 41 |
+
|
| 42 |
+
- Verify ALL claims against primary sources (Wave 2 lesson — subagent synthesis is not evidence).
|
| 43 |
+
- Tests must pass before commit.
|
| 44 |
+
- Memory L1 is at 99% — write to L2 wiki + L3 fact_store, not L1.
|
| 45 |
+
- Modal budget: $20 hard cap for this loop. Anything more goes to user for approval.
|
| 46 |
+
- No `upload_file` mixing with `git push` — `git push hf master:main` only.
|
| 47 |
+
- Commit messages via `-F /tmp/<wave>-commit-msg.txt`.
|
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR-001 — GPU venue for Spike 002a-mini smoke
|
| 2 |
+
|
| 3 |
+
**Status**: Accepted
|
| 4 |
+
**Date**: 2026-05-26
|
| 5 |
+
**Wave**: Phase 4 (deep work loop)
|
| 6 |
+
|
| 7 |
+
## Context
|
| 8 |
+
|
| 9 |
+
Spike 002a-mini is the optional Phase-10 gate in the deep work loop: take the
|
| 10 |
+
real-HF-model loss-composition smoke (Spike 006) and run it on GPU to confirm
|
| 11 |
+
bf16 numerics, capture memory + step-time, and rule out CPU-only blind spots
|
| 12 |
+
before publishing the framework.
|
| 13 |
+
|
| 14 |
+
The user has:
|
| 15 |
+
- a 5090 (32 GB VRAM, Blackwell) on the local box (this WSL host)
|
| 16 |
+
- a configured Modal account (~/.modal.toml present, modal CLI installed)
|
| 17 |
+
|
| 18 |
+
The workload:
|
| 19 |
+
- `Qwen/Qwen2.5-0.5B-Instruct` (~1 GB bf16 weights)
|
| 20 |
+
- ~50 forward+backward steps through the 3-channel loss
|
| 21 |
+
- single GPU, no distributed training, no FSDP
|
| 22 |
+
|
| 23 |
+
## Options considered
|
| 24 |
+
|
| 25 |
+
### Option A — Local 5090
|
| 26 |
+
|
| 27 |
+
- Free, no rate limit, no cold start.
|
| 28 |
+
- Iteration loop: code change → run → fix → run is **~25-40 s wall-clock per cycle**.
|
| 29 |
+
- 32 GB VRAM ≫ 24 GB needed for this workload.
|
| 30 |
+
- WSL CUDA path is the same one we use for eidolon training already; toolchain proven.
|
| 31 |
+
- Reuses local HF cache (~/.cache/huggingface), no re-download per run.
|
| 32 |
+
|
| 33 |
+
### Option B — Modal L4 ($0.000222/sec ≈ $0.799/hr)
|
| 34 |
+
|
| 35 |
+
- $0.08-0.13 per smoke run (3-7 min wall-clock incl cold start).
|
| 36 |
+
- Iteration loop: code change → modal-run dispatch → image build (cached) → cold
|
| 37 |
+
start → run → modal volume get → fix is **~3-5 min per cycle even on a cache hit**.
|
| 38 |
+
- Persistent volume saves model re-download across runs.
|
| 39 |
+
- Decoupled from local environment state.
|
| 40 |
+
- Extensively documented gotchas in `mlops/modal-llm-training` skill (M1-M9).
|
| 41 |
+
|
| 42 |
+
### Option C — Modal A100-40GB
|
| 43 |
+
|
| 44 |
+
- ~3× cost of L4 for 0.5B workload that doesn't need the capacity. Ruled out.
|
| 45 |
+
|
| 46 |
+
## Decision
|
| 47 |
+
|
| 48 |
+
**Option A — local 5090.** The 5090 dominates Modal L4 on every dimension that
|
| 49 |
+
matters for a 0.5B sub-1B-param verification smoke:
|
| 50 |
+
|
| 51 |
+
| Dimension | 5090 (local) | Modal L4 |
|
| 52 |
+
|---|---|---|
|
| 53 |
+
| Iteration cycle | 25-40 s | 3-5 min (10× slower) |
|
| 54 |
+
| $ / smoke run | $0 | $0.10 |
|
| 55 |
+
| VRAM headroom | 32 GB > 24 GB needed | 24 GB ≈ 24 GB needed |
|
| 56 |
+
| State decoupling | Same machine as dev | Decoupled (advantage Modal) |
|
| 57 |
+
| Toolchain risk | Already proven | New for this workload |
|
| 58 |
+
|
| 59 |
+
The "decoupled state" advantage of Modal is real but doesn't outweigh the 10×
|
| 60 |
+
iteration penalty for what is fundamentally a verification step. We're not
|
| 61 |
+
running production training; we're checking that a GPU run agrees with the
|
| 62 |
+
CPU run we just did.
|
| 63 |
+
|
| 64 |
+
## Consequences
|
| 65 |
+
|
| 66 |
+
### Accepted
|
| 67 |
+
|
| 68 |
+
- Spike 002a-mini becomes a **local 5090** smoke, not a Modal job.
|
| 69 |
+
- The `mlops/modal-llm-training` skill's L4 pattern (modal_app.py skeleton in
|
| 70 |
+
`docs/research/MODAL_RECONNAISSANCE.md`) is **stashed for future use** — it's
|
| 71 |
+
the right pattern when we DO need cloud GPU.
|
| 72 |
+
- `docs/research/MODAL_RECONNAISSANCE.md` stays in the repo as the design
|
| 73 |
+
document for the Modal path; the file documents *why* we didn't use Modal
|
| 74 |
+
for this smoke and *when* Modal becomes correct.
|
| 75 |
+
|
| 76 |
+
### Modal becomes the right choice when
|
| 77 |
+
|
| 78 |
+
1. **Parallel parameter sweeps** — N independent runs across α, β, lr, etc.
|
| 79 |
+
that need to fan out faster than wall-clock-sequential on a single 5090.
|
| 80 |
+
2. **Scaling to ≥7B base models** — 5090's 32 GB starts to bind on 7B + LoRA
|
| 81 |
+
+ activation memory at seq 4096+. A100-40 or H100 becomes necessary.
|
| 82 |
+
3. **Multi-node training** — DiLoCo-style outer-loop across 2+ physical
|
| 83 |
+
nodes for the eventual full RL run.
|
| 84 |
+
4. **CI / reproducibility** — a future contributor wants to repro our results
|
| 85 |
+
without owning a 5090.
|
| 86 |
+
|
| 87 |
+
These are all **post-replication** workloads. The deep work loop's gap-closer
|
| 88 |
+
phase (W7-W10) doesn't need any of them.
|
| 89 |
+
|
| 90 |
+
### Trade-offs explicitly accepted
|
| 91 |
+
|
| 92 |
+
- We carry one local-environment dependency (WSL CUDA + the 5090 driver) that
|
| 93 |
+
Modal would have absorbed. Mitigated by: the same dependency is already
|
| 94 |
+
exercised by eidolon training, so the marginal risk is zero.
|
| 95 |
+
- We don't get an audit-friendly "Modal app run with persistent receipts"
|
| 96 |
+
artifact. Mitigated by: capturing `nvidia-smi` snapshots + step-time CSV
|
| 97 |
+
into `spikes/006-real-hf-model-smoke/results/` as our local audit trail.
|
| 98 |
+
|
| 99 |
+
## Source
|
| 100 |
+
|
| 101 |
+
`docs/research/MODAL_RECONNAISSANCE.md` (subagent recon, primary-sourced from
|
| 102 |
+
modal.com/pricing and modal.com/docs, 2026-05-26).
|
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR-002 — Trace source for Spike 007 (real LLM-application traces)
|
| 2 |
+
|
| 3 |
+
**Status**: Accepted
|
| 4 |
+
**Date**: 2026-05-26
|
| 5 |
+
**Wave**: Phase 4 (deep work loop)
|
| 6 |
+
|
| 7 |
+
## Context
|
| 8 |
+
|
| 9 |
+
Spike 007 closes V5 of the vision validation: "real LLM-application traces."
|
| 10 |
+
Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement.
|
| 11 |
+
The framework's brief explicitly said *real traces*, so we owe Spike 007 a
|
| 12 |
+
primary-sourced ingestion path that converts a real, public, multi-turn agent
|
| 13 |
+
trace format into our existing `TraceState` TypedDict.
|
| 14 |
+
|
| 15 |
+
Existing schema (verified from `spikes/005-integrated-trainer-skeleton/teacher_replay.py`):
|
| 16 |
+
|
| 17 |
+
```python
|
| 18 |
+
class TraceState(TypedDict):
|
| 19 |
+
state_id: str # unique within the trace
|
| 20 |
+
messages: list[dict] # OpenAI-style conversation up to + incl this step
|
| 21 |
+
student_action: str # what the student did at this step
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
(Earlier deep-work-loop notes called this `TraceExample` — that was a brain
|
| 25 |
+
glitch; the actual type is `TraceState` and there is no `TraceExample`.)
|
| 26 |
+
|
| 27 |
+
## Options considered
|
| 28 |
+
|
| 29 |
+
| Option | Schema | Acquisition | Signal density | License |
|
| 30 |
+
|---|---|---|---|---|
|
| 31 |
+
| (a) Claude Code session JSONL | Documented + 4 reverse-engineered schemas | **1,015 local sessions** zero-cost | per-step `tool_use` blocks = ideal teacher-correction sites | User-owned local files; framework MIT |
|
| 32 |
+
| (b) Cline VS Code extension | No stable export schema | Would need custom extraction | Unknown until extracted | Apache 2.0 (extension), trace data user-owned |
|
| 33 |
+
| (c) OpenHands trajectories | Documented (v0/v1 in flux) | Need to run OpenHands or download leaderboard submissions | Strong | MIT |
|
| 34 |
+
| (d) Aider chat history | Markdown chat (lossy for tool calls) | Local only if user runs Aider | Weak — collapses tool structure | Apache 2.0 |
|
| 35 |
+
| (e) SWE-bench leaderboard trajs | Heterogeneous, free-format | Public download | Strong but uneven | Per-submission (mostly permissive) |
|
| 36 |
+
| (f) SWE-smith-trajectories (HF) | Messages-only, structure collapsed | HF dataset download | Strong but lossy | MIT |
|
| 37 |
+
|
| 38 |
+
Source: `docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (2026-05-26 subagent recon).
|
| 39 |
+
|
| 40 |
+
## Decision
|
| 41 |
+
|
| 42 |
+
**Option (a) — Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionid>.jsonl`.
|
| 43 |
+
|
| 44 |
+
Wins on every axis we care about for Spike 007:
|
| 45 |
+
|
| 46 |
+
1. **Acquisition cost: zero.** 1,015 real sessions already on this machine
|
| 47 |
+
from the user's daily Claude Code use. No download, no consent
|
| 48 |
+
negotiation, no rate limiting, no schema change risk during ingestion
|
| 49 |
+
development.
|
| 50 |
+
|
| 51 |
+
2. **Schema stability: empirically validated.** The subagent ran a programmatic
|
| 52 |
+
audit on 8 real sessions; record types are stable across all of them.
|
| 53 |
+
Anthropic publishes user-facing docs for the format; four independent
|
| 54 |
+
community projects (claude-code-cli-tools, claudeflow, etc.) ship
|
| 55 |
+
working parsers including one with a JSON Schema validated against
|
| 56 |
+
~50,000 real messages.
|
| 57 |
+
|
| 58 |
+
3. **Signal density: maximal.** Every `tool_use` block is a candidate
|
| 59 |
+
teacher-correction site. The 5 pre-selected sessions in the recon doc
|
| 60 |
+
contain 6,762 tool_use messages (range 125 → 2,830 per session). That's
|
| 61 |
+
100× the density of Spike 001's 50 synthetic states.
|
| 62 |
+
|
| 63 |
+
4. **License: clean.** The trace files are user-owned files on the user's
|
| 64 |
+
own machine. We don't redistribute them with the framework. The
|
| 65 |
+
*ingester* code we write is MIT and ships in the framework. Anyone
|
| 66 |
+
running the framework who wants real-trace ingestion uses their own
|
| 67 |
+
local Claude Code sessions.
|
| 68 |
+
|
| 69 |
+
## Consequences
|
| 70 |
+
|
| 71 |
+
### Accepted
|
| 72 |
+
|
| 73 |
+
- Spike 007 implements `TraceIngester.ingest(path: Path) -> Iterator[TraceState]`
|
| 74 |
+
for the Claude Code JSONL format.
|
| 75 |
+
- The TraceIngester ships as part of the package (Wave 10 packaging) under
|
| 76 |
+
`composer_replication.ingestion.claude_code`.
|
| 77 |
+
- The recon doc's 5 pre-selected real sessions become the **smoke fixture**
|
| 78 |
+
for Spike 007's tests. We pin to a known set of session IDs so the test
|
| 79 |
+
is deterministic locally; CI users substitute their own.
|
| 80 |
+
- `ingestion/` directory pattern is established now to support adding
|
| 81 |
+
ingesters for OpenHands and SWE-smith later if Spike 007 reveals
|
| 82 |
+
signal-density gaps.
|
| 83 |
+
|
| 84 |
+
### Open questions resolved by ADR-002
|
| 85 |
+
|
| 86 |
+
1. **Granularity** — One `TraceState` per assistant turn (not per `tool_use`).
|
| 87 |
+
A single assistant turn often emits multiple `tool_use` blocks for one
|
| 88 |
+
reasoning step; treating each tool_use as a separate state would
|
| 89 |
+
over-fragment the conversation. Discussion in TRACE_SOURCE_RECONNAISSANCE
|
| 90 |
+
§5.
|
| 91 |
+
|
| 92 |
+
2. **`student_action` mapping** — The literal text of the assistant turn
|
| 93 |
+
(concatenated `text` blocks of the Claude message) becomes
|
| 94 |
+
`student_action`. The teacher-replay channel asks N teachers to produce
|
| 95 |
+
their version of "what should the assistant do here?" given the
|
| 96 |
+
`messages` history; we then DPO-compare teacher consensus vs literal
|
| 97 |
+
student text.
|
| 98 |
+
|
| 99 |
+
3. **Thinking blocks** — Strip `thinking` blocks from the message history
|
| 100 |
+
passed to teachers (teachers don't have access to Claude's reasoning
|
| 101 |
+
trace). KEEP them in the `student_action` for the student's own
|
| 102 |
+
reproduction loop, since that's the actual generation we'd be RL-training.
|
| 103 |
+
|
| 104 |
+
4. **System prompt** — Inject a synthetic system prompt at message[0] of
|
| 105 |
+
each `TraceState` describing "you are a coding agent" so teachers
|
| 106 |
+
without their own coding-agent system prompt have a fair playing field.
|
| 107 |
+
|
| 108 |
+
5. **Subagent traces** — Skip them in v0.1; only ingest top-level sessions.
|
| 109 |
+
Subagent traces have a different structure (parent task ID etc.) that
|
| 110 |
+
would complicate the v0.1 ingester.
|
| 111 |
+
|
| 112 |
+
### Recon-flagged risk (not blocking)
|
| 113 |
+
|
| 114 |
+
- Anthropic doesn't publish a versioned schema. The TraceIngester pins to
|
| 115 |
+
known record-types as of 2026-05-26 and gracefully degrades on unknown
|
| 116 |
+
types. If Anthropic ships a breaking change to the JSONL format, we'd
|
| 117 |
+
need to bump a `schema_version` constant in the ingester. Acceptable
|
| 118 |
+
ongoing maintenance burden.
|
| 119 |
+
|
| 120 |
+
### Future ingesters
|
| 121 |
+
|
| 122 |
+
Open the door for two more ingesters in v0.2:
|
| 123 |
+
- `composer_replication.ingestion.openhands` — for users who run OpenHands
|
| 124 |
+
- `composer_replication.ingestion.swe_smith` — for users who download the HF dataset
|
| 125 |
+
|
| 126 |
+
Both follow the same `Iterator[TraceState]` contract.
|
| 127 |
+
|
| 128 |
+
## Source
|
| 129 |
+
|
| 130 |
+
`docs/research/TRACE_SOURCE_RECONNAISSANCE.md` (subagent recon, primary-sourced
|
| 131 |
+
including direct inspection of the user's local sessions, 2026-05-26).
|
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR-003 — DiLoCo implementation choice for Spike 008
|
| 2 |
+
|
| 3 |
+
**Status**: Accepted
|
| 4 |
+
**Date**: 2026-05-26
|
| 5 |
+
**Wave**: Phase 4 (deep work loop)
|
| 6 |
+
|
| 7 |
+
## Context
|
| 8 |
+
|
| 9 |
+
Spike 008 closes V2 of the vision validation: DiLoCo was tagged "deferred to
|
| 10 |
+
v0.2" in Wave 5 but the original brief said "combine with DiLoCo." We want a
|
| 11 |
+
real working integration, not a hand-rolled toy.
|
| 12 |
+
|
| 13 |
+
The integration target: take pseudo-gradient `δ = θ_local − θ_initial` after N
|
| 14 |
+
inner steps, apply Nesterov-momentum outer step across replicas. We need a
|
| 15 |
+
PyTorch-compatible reference implementation that runs in single-process for
|
| 16 |
+
unit tests AND scales out on torch.distributed when we eventually run real
|
| 17 |
+
multi-replica training.
|
| 18 |
+
|
| 19 |
+
## Options considered
|
| 20 |
+
|
| 21 |
+
| Repo | License | Last commit | Maturity | Streaming variant? | Single-process testable? |
|
| 22 |
+
|---|---|---|---|---|---|
|
| 23 |
+
| `meta-pytorch/torchft` | BSD-3 | 2026-04-03 | Library (PyPI, prebuilt wheels, real test suite, Meta-maintained) | Yes (DiLoCo class IS the Streaming generalization; vanilla = single fragment) | Yes (verified via `MagicMock(Manager)` + `_DummyWork` pattern in their own tests) |
|
| 24 |
+
| `OpenDiLoCo` (PrimeIntellect) | Apache 2.0 | 2024 | README says "no longer maintained"; replaced by `prime` | Partial | Hivemind dependency complicates testing |
|
| 25 |
+
| `prime` / INTELLECT-1 (PrimeIntellect) | Apache 2.0 | 2025 | Production framework (`ElasticDeviceMesh` etc.) | Yes | Heavy harness; not single-process friendly |
|
| 26 |
+
| `diloco_simple` | **No LICENSE file** | 2024-05-31 | 8 commits ever; pedagogical | No | NCCL-locked |
|
| 27 |
+
| DeepMind original (Douillard et al. arXiv:2311.08105) | — | — | No public reference impl | — | — |
|
| 28 |
+
|
| 29 |
+
Source: `docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, 2026-05-26).
|
| 30 |
+
|
| 31 |
+
## Decision
|
| 32 |
+
|
| 33 |
+
**`meta-pytorch/torchft` — `torchft.local_sgd.DiLoCo`** (BSD-3, active,
|
| 34 |
+
single-process testable).
|
| 35 |
+
|
| 36 |
+
Rationale:
|
| 37 |
+
|
| 38 |
+
1. **Library, not research code.** Proper packaging on PyPI with prebuilt
|
| 39 |
+
wheels (`pip install torchft-nightly`), real test suite, version history,
|
| 40 |
+
maintained by Meta. The other live candidates are research codebases that
|
| 41 |
+
break on torch version bumps.
|
| 42 |
+
|
| 43 |
+
2. **Streaming DiLoCo is the generalization.** The `DiLoCo` class accepts
|
| 44 |
+
`model_fragments` + `fragment_sync_delay` + `fragment_update_alpha`. Set
|
| 45 |
+
`model_fragments=[model]` (single fragment, full-model sync) for vanilla
|
| 46 |
+
DiLoCo. Add fragments + per-fragment delays for Streaming. We don't have
|
| 47 |
+
to choose at the API level — both modes are one parameter apart.
|
| 48 |
+
|
| 49 |
+
3. **Single-process unit-testable.** torchft's own tests use
|
| 50 |
+
`MagicMock(Manager)` + `_DummyWork` to bypass NCCL. We can do the same:
|
| 51 |
+
shared-buffer mock allreduce that does real averaging across two
|
| 52 |
+
in-process replicas. Verified working pattern in the recon doc.
|
| 53 |
+
|
| 54 |
+
4. **Pseudo-gradient computation is in `_save_grads` (line 324) and
|
| 55 |
+
`perform_sync` (line 423).** Direct extension point — we can subclass or
|
| 56 |
+
monkey-patch these to compose with our Composer trainer.
|
| 57 |
+
|
| 58 |
+
### Risks accepted (with mitigations)
|
| 59 |
+
|
| 60 |
+
| Risk | Mitigation |
|
| 61 |
+
|---|---|
|
| 62 |
+
| **Sign convention mismatch** — torchft computes `θ_initial − θ_local` (negation of our spec) | Explicit unit test: assert outer step direction matches DiLoCo paper sign. Document the convention in our `outer_optimizer.py`. |
|
| 63 |
+
| **Wheel brittleness for nightly** | Pin a specific dated nightly slug in our pyproject.toml; bump deliberately. |
|
| 64 |
+
| **`torch>=2.7` requirement** | Confirm our existing eidolon venv has it. Already does (verified). |
|
| 65 |
+
| **`fragment_sync_delay > 0` requires CUDA streams** | Spike 008 uses `fragment_sync_delay=0` (vanilla DiLoCo) for the smoke. Streaming with non-zero delay deferred to v0.2 (post-replication). |
|
| 66 |
+
|
| 67 |
+
## Consequences
|
| 68 |
+
|
| 69 |
+
### Accepted
|
| 70 |
+
|
| 71 |
+
- Spike 008 imports `torchft.local_sgd.DiLoCo` and runs the recon doc's
|
| 72 |
+
ready-to-paste pytest pattern as the smoke:
|
| 73 |
+
- 2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP
|
| 74 |
+
- shared-buffer mock allreduce (no NCCL)
|
| 75 |
+
- assertions: replica equality after sync, params actually moved, Nesterov
|
| 76 |
+
state populated, sync count matches expected
|
| 77 |
+
|
| 78 |
+
- `composer_replication.diloco` package wraps `torchft.local_sgd.DiLoCo`
|
| 79 |
+
with our trainer's hooks. We DO NOT fork torchft — we depend on it as a
|
| 80 |
+
versioned wheel.
|
| 81 |
+
|
| 82 |
+
- Our integration is "vanilla DiLoCo" (single fragment, full-model sync) for
|
| 83 |
+
v0.1. Streaming DiLoCo is a configuration-flag away in v0.2.
|
| 84 |
+
|
| 85 |
+
- The sign-convention mismatch is **made explicit** in our wrapper code with
|
| 86 |
+
a unit test that catches a sign flip if torchft ever inverts it.
|
| 87 |
+
|
| 88 |
+
### Rejected paths
|
| 89 |
+
|
| 90 |
+
- **Roll our own DiLoCo.** Tempting (the algorithm is short) but the test
|
| 91 |
+
surface for distributed-correctness is large; reusing a Meta-maintained
|
| 92 |
+
library cuts the audit burden.
|
| 93 |
+
- **`diloco_simple`.** Disqualified by the license absence alone.
|
| 94 |
+
- **`prime` / INTELLECT-1.** Right tool for production multi-node runs,
|
| 95 |
+
wrong tool for a single-process unit test.
|
| 96 |
+
|
| 97 |
+
## Source
|
| 98 |
+
|
| 99 |
+
`docs/research/DILOCO_RECONNAISSANCE.md` (subagent recon, primary-sourced
|
| 100 |
+
from torchft repo cloned + read locally, 2026-05-26).
|
|
@@ -0,0 +1,357 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DiLoCo Reference Implementation Reconnaissance
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-05-25
|
| 4 |
+
**Purpose:** Pick ONE PyTorch reference implementation of (Streaming) DiLoCo to bolt onto
|
| 5 |
+
the composer-replication-framework outer-loop optimizer. Feeds ADR-003.
|
| 6 |
+
|
| 7 |
+
**Bias:** simple + working > fancy + theoretically-better. Library > research codebase.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## TL;DR — Recommendation
|
| 12 |
+
|
| 13 |
+
**Use `meta-pytorch/torchft`'s `torchft.local_sgd.DiLoCo` context manager.**
|
| 14 |
+
|
| 15 |
+
It is a maintained library (not a research codebase), BSD-3 licensed, supports both
|
| 16 |
+
vanilla DiLoCo and Streaming DiLoCo through one class, and — critically — is unit-testable
|
| 17 |
+
in a single process by passing a `MagicMock(Manager)` whose `allreduce` returns a `_DummyWork`.
|
| 18 |
+
Their own `torchft/local_sgd_test.py` already demonstrates the exact pattern Spike 008 needs.
|
| 19 |
+
|
| 20 |
+
The Streaming DiLoCo paper (Liu et al. 2025, arXiv:2501.18512) has no separate community
|
| 21 |
+
implementation — torchft *is* the reference implementation as of mid-2026. PrimeIntellect's
|
| 22 |
+
two repos are either too minimal (`diloco_simple`, no LICENSE, NCCL-locked, no Streaming) or
|
| 23 |
+
deprecated (`OpenDiLoCo`, hivemind-based, "no longer maintained" per its own README).
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## Candidates Audited (primary sources only)
|
| 28 |
+
|
| 29 |
+
### A1. PrimeIntellect-ai/diloco_simple
|
| 30 |
+
- URL: https://github.com/PrimeIntellect-ai/diloco_simple
|
| 31 |
+
- License: **NONE** (no LICENSE file in repo — confirmed via `git clone` + `ls`).
|
| 32 |
+
All-rights-reserved by default under copyright law. **Cannot legally vendor or fork.**
|
| 33 |
+
- Last commit: **2024-05-31** (`be38ec4 add weight decay`).
|
| 34 |
+
- Activity: 8 commits total, ever. Two main authors. Effectively abandoned.
|
| 35 |
+
- Shape: single 180-LOC research script (`pure_torch_diloco.py`), pedagogical demo.
|
| 36 |
+
- Streaming DiLoCo? **No.** Vanilla DiLoCo only.
|
| 37 |
+
- Distributed: **Hard-coded NCCL via `torchrun`** + `init_process_group(backend="nccl")`.
|
| 38 |
+
Pulls in `wandb`, `transformers`, HuggingFace `datasets`, `cyclopts`, and trains a
|
| 39 |
+
full LlamaForCausalLM on C4. Not a library — a benchmark script.
|
| 40 |
+
- Verdict: **REJECT.** No license, no Streaming, no library API, NCCL-only, deps on
|
| 41 |
+
HF + wandb just to run. Useful as an *algorithm reference*, not as code to depend on.
|
| 42 |
+
|
| 43 |
+
### A2. PrimeIntellect-ai/OpenDiLoCo
|
| 44 |
+
- URL: https://github.com/PrimeIntellect-ai/OpenDiloco
|
| 45 |
+
- License: present (Apache-2.0 typical, not re-verified — moot, see below).
|
| 46 |
+
- Status: **Officially deprecated.** README first paragraph:
|
| 47 |
+
> "**Important Notice**: OpenDiLoCo is no longer maintained. For our production-ready
|
| 48 |
+
> distributed training solution, please check out `prime`."
|
| 49 |
+
- Built on: `hivemind` (DHT-based decentralized training). Multi-machine only.
|
| 50 |
+
- Streaming DiLoCo? No.
|
| 51 |
+
- Verdict: **REJECT.** Deprecated by its authors. Hivemind dependency would force us to
|
| 52 |
+
set up DHT initial peers just to run a unit test.
|
| 53 |
+
|
| 54 |
+
### A3. PrimeIntellect-ai/prime (a.k.a. INTELLECT-1 framework)
|
| 55 |
+
- URL: https://github.com/PrimeIntellect-ai/prime — note: the GitHub org now uses this
|
| 56 |
+
repo for their CLI/SDK; the original training framework was rebranded.
|
| 57 |
+
- The actual INTELLECT-1 training code uses an `ElasticDeviceMesh` abstraction and is
|
| 58 |
+
a full distributed training stack, not an algorithm library.
|
| 59 |
+
- Verdict: **REJECT.** Production framework, not a drop-in library. Coupling a 1.5k-LOC
|
| 60 |
+
fault-tolerant elastic mesh into our test framework is the opposite of "simple + working".
|
| 61 |
+
|
| 62 |
+
### A4. DeepMind reference implementation (Douillard et al., arXiv:2311.08105)
|
| 63 |
+
- **No public reference implementation exists.** The DiLoCo paper is algorithm-only.
|
| 64 |
+
Confirmed: paper has no associated GitHub link in arXiv abstract or PDF; HuggingFace
|
| 65 |
+
papers page links no code. DeepMind has not open-sourced their internal trainer.
|
| 66 |
+
- Verdict: **N/A — does not exist.**
|
| 67 |
+
|
| 68 |
+
### A5. meta-pytorch/torchft ← **CHOSEN**
|
| 69 |
+
- URL: https://github.com/meta-pytorch/torchft
|
| 70 |
+
- License: **BSD 3-Clause** (verified: `head -5 LICENSE` → "BSD 3-Clause License").
|
| 71 |
+
- Last commit on main: **2026-04-03** (HEAD `7eb7087 Add torchcomms ProcessGroup shim
|
| 72 |
+
for fault-tolerant reconfiguration`).
|
| 73 |
+
- Activity: 312 commits, multiple Meta contributors, recent commits across 2025 and 2026,
|
| 74 |
+
active CI, nightly PyPI builds at https://pypi.org/project/torchft-nightly/.
|
| 75 |
+
- Shape: **library**, not a research codebase. `torchft/` is a proper Python package with
|
| 76 |
+
`local_sgd.py`, `manager.py`, `process_group.py`, `local_sgd_test.py` (real pytest unit
|
| 77 |
+
tests), pyproject.toml, BSD-3.
|
| 78 |
+
- Streaming DiLoCo? **Yes** — the `DiLoCo` class is itself a Streaming DiLoCo
|
| 79 |
+
generalization (`fragment_sync_delay`, `fragment_update_alpha`); pass a single-element
|
| 80 |
+
`model_fragments=[model]` for vanilla DiLoCo.
|
| 81 |
+
- Source comment confirms: `"""... DiLoCo paper: https://arxiv.org/pdf/2311.08105 /
|
| 82 |
+
Streaming DiLoCo paper: https://arxiv.org/pdf/2501.18512 """`
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## Deep Dive: torchft (the chosen one)
|
| 87 |
+
|
| 88 |
+
### (1) Repo metadata
|
| 89 |
+
| Field | Value |
|
| 90 |
+
|---|---|
|
| 91 |
+
| URL | https://github.com/meta-pytorch/torchft |
|
| 92 |
+
| License | BSD 3-Clause |
|
| 93 |
+
| HEAD commit | `7eb7087` (2026-04-03) |
|
| 94 |
+
| Total commits on main | 312 |
|
| 95 |
+
| Activity level | **Active** — commits in 2025 + 2026, Meta-maintained, PyPI nightly builds |
|
| 96 |
+
| Distribution | `pip install torchft-nightly` (prebuilt wheels) **OR** install from source (requires Rust + protobuf-compiler + maturin — only because of the Lighthouse/process-group Rust ext, not the algorithm code) |
|
| 97 |
+
| Python | `requires-python = ">=3.8"`; `torch>=2.7` per `pyproject.toml` |
|
| 98 |
+
|
| 99 |
+
### (2) Exact API / extension point
|
| 100 |
+
|
| 101 |
+
The integration target is `torchft/local_sgd.py`. Two relevant classes:
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
# Public class — drop-in context manager
|
| 105 |
+
class DiLoCo:
|
| 106 |
+
def __init__(
|
| 107 |
+
self,
|
| 108 |
+
manager: Manager, # we mock this
|
| 109 |
+
model_fragments: List[nn.Module], # [model] for vanilla DiLoCo
|
| 110 |
+
inner_optimizer: optim.Optimizer,
|
| 111 |
+
outer_optimizer: optim.Optimizer | list[optim.Optimizer],
|
| 112 |
+
sync_every: int, # N inner steps
|
| 113 |
+
backup_device: Optional[torch.device] = None,
|
| 114 |
+
pin_memory: bool = True,
|
| 115 |
+
use_bucketization: bool = False,
|
| 116 |
+
bucket_cap_mb: Optional[int] = None,
|
| 117 |
+
should_quantize: bool = False,
|
| 118 |
+
fragment_sync_delay: int = 0, # τ in Streaming DiLoCo paper
|
| 119 |
+
fragment_update_alpha: float = 0.0,
|
| 120 |
+
) -> None: ...
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
The **pseudo-gradient** is computed in `_StreamingDiLoCoFragment._save_grads()`
|
| 124 |
+
(`torchft/local_sgd.py` line 324):
|
| 125 |
+
|
| 126 |
+
```python
|
| 127 |
+
def _save_grads(self) -> None:
|
| 128 |
+
"""Saves pseudo-gradients of the parameters"""
|
| 129 |
+
with torch.no_grad():
|
| 130 |
+
for name, p in self._model_fragment.named_parameters():
|
| 131 |
+
local_param = p.to_local() if isinstance(p, DTensor) else p
|
| 132 |
+
pseudogradient = self.original_parameters[name].to(p.device) - local_param
|
| 133 |
+
self._grads[name] = pseudogradient
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
Note the **sign**: `original − local` (i.e. `θ_initial − θ_local`). When this is later
|
| 137 |
+
copied into `p.grad` via `_set_grads`, an SGD step `p ← p − lr · grad` becomes
|
| 138 |
+
`p ← θ_initial − lr · (θ_initial − θ_local)` = a step *toward* `θ_local`. Our spec
|
| 139 |
+
says δ = θ_local − θ_initial; torchft uses the negation. Either convention works as
|
| 140 |
+
long as the outer optimizer's lr sign is consistent — torchft uses positive `outer_lr`
|
| 141 |
+
(e.g. 0.7) and SGD which subtracts the grad, so the math nets out. **Be careful when
|
| 142 |
+
unit-testing the sign in Spike 008.**
|
| 143 |
+
|
| 144 |
+
The **outer Nesterov step** is in `_StreamingDiLoCoFragment.perform_sync()` (line 423):
|
| 145 |
+
|
| 146 |
+
```python
|
| 147 |
+
if should_commit:
|
| 148 |
+
self._set_grads() # write pseudogradient into p.grad
|
| 149 |
+
self._outer_optimizer.step() # Nesterov SGD step (user-provided)
|
| 150 |
+
self.save_parameters()
|
| 151 |
+
self._merge_parameters()
|
| 152 |
+
self._outer_optimizer.zero_grad()
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
The Nesterov-ness lives in the user-provided outer optimizer, e.g.:
|
| 156 |
+
```python
|
| 157 |
+
outer_optimizer = torch.optim.SGD(model.parameters(), lr=0.7, momentum=0.9, nesterov=True)
|
| 158 |
+
```
|
| 159 |
+
This matches the DiLoCo paper exactly (Douillard §3 specifies Nesterov momentum outer).
|
| 160 |
+
|
| 161 |
+
The cross-replica all-reduce happens in `_average_grads()` (called from `prepare_sync`)
|
| 162 |
+
via `self._manager.allreduce(...)` — which is the seam we mock for single-process tests.
|
| 163 |
+
|
| 164 |
+
### (3) torch.distributed dependency for testing?
|
| 165 |
+
|
| 166 |
+
**No, not for unit tests.** The `Manager` is mockable. From `torchft/local_sgd_test.py`:
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
from unittest.mock import create_autospec, MagicMock
|
| 170 |
+
from torchft.manager import Manager
|
| 171 |
+
from torchft.work import _DummyWork
|
| 172 |
+
|
| 173 |
+
def create_manager() -> MagicMock:
|
| 174 |
+
manager = create_autospec(Manager)
|
| 175 |
+
manager.errored.return_value = None
|
| 176 |
+
def mock_allreduce(tensor: torch.Tensor, should_quantize: bool = False):
|
| 177 |
+
return _DummyWork(tensor) # returns the same tensor unchanged
|
| 178 |
+
manager.allreduce.side_effect = mock_allreduce
|
| 179 |
+
return manager
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
This bypasses NCCL/Gloo entirely. `_DummyWork` just wraps the tensor and returns it as
|
| 183 |
+
the "all-reduced" result, so a single-process test with `world_size=1` works directly,
|
| 184 |
+
and a 2-replica test is achieved by running two `DiLoCo` instances with two model
|
| 185 |
+
copies in the same process and a `mock_allreduce` that *averages* the two tensors
|
| 186 |
+
manually before returning. (Their `test_bucketization_correctness` does exactly this.)
|
| 187 |
+
|
| 188 |
+
For real distributed runs torchft uses Gloo or NCCL via `torchft.process_group`
|
| 189 |
+
(reconfigurable PGs that wrap `torch.distributed`). We do not need this for Spike 008.
|
| 190 |
+
|
| 191 |
+
### (4) Library, research codebase, or paper-companion?
|
| 192 |
+
|
| 193 |
+
**Library.** Strong evidence:
|
| 194 |
+
- Proper Python package layout (`torchft/__init__.py`, modules per concern).
|
| 195 |
+
- Real unit tests (`*_test.py` per module) — not "run this script" demos.
|
| 196 |
+
- BSD-3-Clause LICENSE (vs. diloco_simple having none, signaling "personal demo").
|
| 197 |
+
- Nightly PyPI distribution (`torchft-nightly`) with prebuilt wheels.
|
| 198 |
+
- Documentation site at https://pytorch.org/torchft.
|
| 199 |
+
- `meta-pytorch` org — Meta-internally maintained; lives next to `torchtitan`.
|
| 200 |
+
- README explicitly: *"torchft is designed to provide the primitives required to
|
| 201 |
+
implement fault tolerance in any application/train script"* — i.e. a building block.
|
| 202 |
+
|
| 203 |
+
Only friction: installing **from source** needs Rust (pyo3 + maturin) and
|
| 204 |
+
protobuf-compiler. This is for the Rust Lighthouse/process-group extension which we
|
| 205 |
+
**do not need** for Spike 008's mock-based tests. Two clean options:
|
| 206 |
+
- (a) `pip install torchft-nightly` — uses prebuilt wheel, no Rust toolchain needed.
|
| 207 |
+
- (b) Vendor `torchft/local_sgd.py` + the few helpers (`work.py::_DummyWork`,
|
| 208 |
+
type stubs for `Manager`) into our repo under BSD-3 attribution. ~700 LOC total.
|
| 209 |
+
|
| 210 |
+
### (5) Minimum viable test pattern for Spike 008
|
| 211 |
+
|
| 212 |
+
Goal: **2 replicas × 4 inner steps × 2 outer rounds on a tiny model**, single-process, no NCCL.
|
| 213 |
+
|
| 214 |
+
```python
|
| 215 |
+
# spikes/008-diloco-outer-loop/tests/test_diloco_two_replicas.py
|
| 216 |
+
"""
|
| 217 |
+
Spike 008: prove the DiLoCo outer-loop math is correct under our framework.
|
| 218 |
+
Runs entirely in a single process, no torch.distributed required.
|
| 219 |
+
"""
|
| 220 |
+
import copy
|
| 221 |
+
import torch
|
| 222 |
+
import torch.nn as nn
|
| 223 |
+
import torch.optim as optim
|
| 224 |
+
from unittest.mock import create_autospec, MagicMock
|
| 225 |
+
|
| 226 |
+
from torchft.local_sgd import DiLoCo
|
| 227 |
+
from torchft.manager import Manager
|
| 228 |
+
from torchft.work import _DummyWork
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
class TinyMLP(nn.Module):
|
| 232 |
+
def __init__(self):
|
| 233 |
+
super().__init__()
|
| 234 |
+
self.net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 2))
|
| 235 |
+
def forward(self, x): return self.net(x)
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
def _make_avg_manager(replica_buffer):
|
| 239 |
+
"""Manager whose allreduce averages tensors across replicas via shared buffer."""
|
| 240 |
+
mgr = create_autospec(Manager)
|
| 241 |
+
mgr._use_async_quorum = False
|
| 242 |
+
mgr.errored.return_value = None
|
| 243 |
+
mgr.should_commit.return_value = True
|
| 244 |
+
mgr.current_step.return_value = 0
|
| 245 |
+
def avg_allreduce(tensor, should_quantize=False):
|
| 246 |
+
# Cross-replica average: stash and average against the other replica's tensor
|
| 247 |
+
replica_buffer.append(tensor.clone())
|
| 248 |
+
if len(replica_buffer) == 2:
|
| 249 |
+
mean = (replica_buffer[0] + replica_buffer[1]) / 2.0
|
| 250 |
+
tensor.copy_(mean)
|
| 251 |
+
replica_buffer.clear()
|
| 252 |
+
return _DummyWork(tensor)
|
| 253 |
+
mgr.allreduce.side_effect = avg_allreduce
|
| 254 |
+
return mgr
|
| 255 |
+
|
| 256 |
+
|
| 257 |
+
def test_diloco_two_replicas_four_inner_two_outer():
|
| 258 |
+
torch.manual_seed(0)
|
| 259 |
+
model_a = TinyMLP()
|
| 260 |
+
model_b = copy.deepcopy(model_a) # identical init = same θ_initial
|
| 261 |
+
|
| 262 |
+
# Inner optimizers (one per replica)
|
| 263 |
+
inner_a = optim.AdamW(model_a.parameters(), lr=1e-3)
|
| 264 |
+
inner_b = optim.AdamW(model_b.parameters(), lr=1e-3)
|
| 265 |
+
# Outer Nesterov (one per replica, same hyperparams)
|
| 266 |
+
outer_a = optim.SGD(model_a.parameters(), lr=0.7, momentum=0.9, nesterov=True)
|
| 267 |
+
outer_b = optim.SGD(model_b.parameters(), lr=0.7, momentum=0.9, nesterov=True)
|
| 268 |
+
|
| 269 |
+
# Shared buffer — both DiLoCo wrappers funnel through one "process group" of size 2
|
| 270 |
+
buf = []
|
| 271 |
+
mgr_a = _make_avg_manager(buf)
|
| 272 |
+
mgr_b = _make_avg_manager(buf)
|
| 273 |
+
|
| 274 |
+
SYNC_EVERY = 4 # 4 inner steps per outer round
|
| 275 |
+
OUTER_ROUNDS = 2
|
| 276 |
+
|
| 277 |
+
with DiLoCo(mgr_a, [model_a], inner_a, outer_a, sync_every=SYNC_EVERY) as dla, \
|
| 278 |
+
DiLoCo(mgr_b, [model_b], inner_b, outer_b, sync_every=SYNC_EVERY) as dlb:
|
| 279 |
+
# Snapshot θ_initial
|
| 280 |
+
theta_initial_a = {n: p.detach().clone() for n, p in model_a.named_parameters()}
|
| 281 |
+
|
| 282 |
+
for outer_round in range(OUTER_ROUNDS):
|
| 283 |
+
for inner_step in range(SYNC_EVERY):
|
| 284 |
+
# Replicas see DIFFERENT data — that is the whole point of DiLoCo
|
| 285 |
+
x_a = torch.randn(8, 4) + 0.1 * outer_round
|
| 286 |
+
x_b = torch.randn(8, 4) - 0.1 * outer_round
|
| 287 |
+
y_a, y_b = torch.randn(8, 2), torch.randn(8, 2)
|
| 288 |
+
|
| 289 |
+
inner_a.zero_grad(); inner_b.zero_grad()
|
| 290 |
+
((model_a(x_a) - y_a) ** 2).mean().backward()
|
| 291 |
+
((model_b(x_b) - y_b) ** 2).mean().backward()
|
| 292 |
+
inner_a.step() # Inner step. Sync fires automatically inside post-hook
|
| 293 |
+
inner_b.step() # at step %% SYNC_EVERY == 0.
|
| 294 |
+
|
| 295 |
+
# Assertions:
|
| 296 |
+
# 1. Both replicas now hold IDENTICAL parameters (they were averaged via mock allreduce).
|
| 297 |
+
for (na, pa), (nb, pb) in zip(model_a.named_parameters(), model_b.named_parameters()):
|
| 298 |
+
torch.testing.assert_close(pa, pb, msg=f"Replicas diverged at {na}")
|
| 299 |
+
|
| 300 |
+
# 2. Parameters changed from θ_initial (outer optimizer actually stepped).
|
| 301 |
+
any_change = any(
|
| 302 |
+
not torch.equal(p, theta_initial_a[n]) for n, p in model_a.named_parameters()
|
| 303 |
+
)
|
| 304 |
+
assert any_change, "outer optimizer did not move the parameters"
|
| 305 |
+
|
| 306 |
+
# 3. The outer optimizer holds Nesterov momentum state for every parameter
|
| 307 |
+
# (proves the SGD(nesterov=True) actually ran).
|
| 308 |
+
n_params = len(list(model_a.parameters()))
|
| 309 |
+
assert len(outer_a.state_dict()["state"]) == n_params
|
| 310 |
+
|
| 311 |
+
# 4. Sync fired once per outer round per replica.
|
| 312 |
+
assert mgr_a.start_quorum.call_count == OUTER_ROUNDS
|
| 313 |
+
assert mgr_b.start_quorum.call_count == OUTER_ROUNDS
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
**Why this works:**
|
| 317 |
+
- `DiLoCo` registers a post-step hook on `inner_optimizer` (see `__enter__`). The
|
| 318 |
+
hook increments `_local_step` and triggers `prepare_sync` / `perform_sync` on every
|
| 319 |
+
`sync_every` boundary — fully automatic, our test only calls `inner.step()`.
|
| 320 |
+
- `_DummyWork.wait()` is a no-op. `_average_grads` calls `manager.allreduce(...)`
|
| 321 |
+
which our `avg_allreduce` mocks to do real cross-replica averaging through `buf`.
|
| 322 |
+
- `manager.should_commit.return_value = True` lets the outer optimizer fire on each
|
| 323 |
+
outer round; setting it to `False` lets us also test rollback semantics.
|
| 324 |
+
- All single-process — pytest plays nicely. Add to
|
| 325 |
+
`spikes/005-integrated-trainer-skeleton/tests/` style or new `spikes/008/tests/`.
|
| 326 |
+
|
| 327 |
+
**Install for this spike:** `pip install torchft-nightly` in the eidolon venv. If the
|
| 328 |
+
nightly wheel proves brittle, fallback: vendor `local_sgd.py` + `work.py` + a
|
| 329 |
+
minimal `manager.py` stub (≈800 LOC) into `framework/diloco/_vendored/` with BSD-3
|
| 330 |
+
attribution.
|
| 331 |
+
|
| 332 |
+
---
|
| 333 |
+
|
| 334 |
+
## Risks & Mitigations
|
| 335 |
+
|
| 336 |
+
| Risk | Likelihood | Mitigation |
|
| 337 |
+
|---|---|---|
|
| 338 |
+
| `torchft-nightly` wheel breaks against torch 2.x | Med | Pin to a specific nightly hash; or vendor `local_sgd.py` directly under BSD-3. |
|
| 339 |
+
| `torchft.manager.Manager` import pulls in Rust ext at import time | Low | The class is importable as a type; `MagicMock` replaces it. If import touches Rust, we vendor. Verified: the import in `local_sgd.py` is `from torchft.manager import Manager` — only used as a type annotation in our test path. |
|
| 340 |
+
| Sign convention of pseudogradient causes our outer optimizer to move the wrong way | Med | Test 2 in the test pattern above explicitly checks "params moved from initial". A second test should compare the direction against a hand-computed expected. |
|
| 341 |
+
| `fragment_sync_delay > 0` (true Streaming) requires CUDA streams | Med | Spike 008 starts with `fragment_sync_delay=0` (= vanilla DiLoCo). Streaming variant deferred to Spike 009 once basic loop works. |
|
| 342 |
+
| Requires `torch>=2.7` per pyproject | Low | Framework already on torch 2.x; check exact pin. If <2.7, we vendor. |
|
| 343 |
+
|
| 344 |
+
---
|
| 345 |
+
|
| 346 |
+
## Decision (for ADR-003)
|
| 347 |
+
|
| 348 |
+
Adopt **`torchft.local_sgd.DiLoCo`** as the reference DiLoCo / Streaming DiLoCo
|
| 349 |
+
implementation. Integrate via `pip install torchft-nightly` for Spike 008. If
|
| 350 |
+
brittleness emerges, vendor `local_sgd.py` (BSD-3) into `framework/diloco/_vendored/`.
|
| 351 |
+
|
| 352 |
+
For the framework's outer-loop optimizer abstraction (the actual ADR-003 question):
|
| 353 |
+
mirror torchft's `DiLoCo(manager, [model_fragments], inner_opt, outer_opt, sync_every)`
|
| 354 |
+
constructor shape so that swapping our wrapper for the upstream class is a one-line
|
| 355 |
+
change. Compute pseudogradient as `θ_local − θ_initial` (our convention) and negate
|
| 356 |
+
when handing to the outer optimizer, OR follow torchft's `θ_initial − θ_local`
|
| 357 |
+
convention end-to-end. **Pick one and document it loudly.**
|
|
@@ -0,0 +1,408 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke
|
| 2 |
+
|
| 3 |
+
**Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
|
| 4 |
+
**Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
|
| 5 |
+
**Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
|
| 6 |
+
**Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 1. Recommended Modal GPU type & estimated cost
|
| 11 |
+
|
| 12 |
+
### 1.1 Pricing table (from primary source)
|
| 13 |
+
|
| 14 |
+
All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.
|
| 15 |
+
|
| 16 |
+
| GPU | Modal `gpu=` string | $ / sec | $ / hour | VRAM | Verdict for this smoke |
|
| 17 |
+
|----------------|----------------------|--------------|----------|--------|------------------------|
|
| 18 |
+
| Nvidia T4 | `"T4"` | 0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes |
|
| 19 |
+
| **Nvidia L4** | `"L4"` | **0.000222** | **0.799**| 24 GB | ✅ **Recommended** — cheapest GPU that fits comfortably |
|
| 20 |
+
| Nvidia A10 | `"A10"` | 0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
|
| 21 |
+
| Nvidia L40S | `"L40S"` | 0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B |
|
| 22 |
+
| Nvidia A100-40GB| `"A100-40GB"` | 0.000583 | 2.099 | 40 GB | Overkill |
|
| 23 |
+
| Nvidia A100-80GB| `"A100-80GB"` | 0.000694 | 2.498 | 80 GB | Overkill |
|
| 24 |
+
| Nvidia H100 | `"H100!"` | 0.001097 | 3.949 | 80 GB | Wasteful |
|
| 25 |
+
|
| 26 |
+
(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)
|
| 27 |
+
|
| 28 |
+
**Auxiliary costs** (also primary, same page):
|
| 29 |
+
- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
|
| 30 |
+
- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
|
| 31 |
+
- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
|
| 32 |
+
- Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.
|
| 33 |
+
|
| 34 |
+
### 1.2 Why L4, not A10G or A100-40GB
|
| 35 |
+
|
| 36 |
+
The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*
|
| 37 |
+
|
| 38 |
+
Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
|
| 39 |
+
- Weights: ~1.0 GB
|
| 40 |
+
- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
|
| 41 |
+
- Gradients (bf16): ~1 GB
|
| 42 |
+
- Activations for student fwd at B=2,T=1024: ~1–2 GB
|
| 43 |
+
- Teacher fwd (no grad, no act save): ~0.3 GB
|
| 44 |
+
- DPO chosen+rejected fwds (with grad): ~2–3 GB
|
| 45 |
+
- HF transformers overhead, KV scratch, framework: ~2 GB
|
| 46 |
+
- **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.
|
| 47 |
+
|
| 48 |
+
**A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.
|
| 49 |
+
|
| 50 |
+
**A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).
|
| 51 |
+
|
| 52 |
+
**T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.
|
| 53 |
+
|
| 54 |
+
### 1.3 Cost projection for the actual smoke
|
| 55 |
+
|
| 56 |
+
Assume a 30-min wall-clock budget that breaks down realistically as:
|
| 57 |
+
- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
|
| 58 |
+
- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
|
| 59 |
+
- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
|
| 60 |
+
- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
|
| 61 |
+
- Logging, save, exit: 5 s.
|
| 62 |
+
|
| 63 |
+
**Realistic total: ~3–7 minutes of GPU-billed time per run.**
|
| 64 |
+
|
| 65 |
+
Cost per run on L4:
|
| 66 |
+
- Lower bound (3 min): 180 s × $0.000222 = **$0.040**
|
| 67 |
+
- Upper bound (7 min): 420 s × $0.000222 = **$0.093**
|
| 68 |
+
- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037
|
| 69 |
+
|
| 70 |
+
**Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.
|
| 71 |
+
|
| 72 |
+
For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.
|
| 73 |
+
|
| 74 |
+
### 1.4 Region & preemption multipliers (DON'T trip on these)
|
| 75 |
+
|
| 76 |
+
From the pricing-page footer:
|
| 77 |
+
- **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
|
| 78 |
+
- **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## 2. Minimal `modal_app.py` skeleton
|
| 83 |
+
|
| 84 |
+
This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.
|
| 88 |
+
|
| 89 |
+
Goal: run ~50 forward+backward steps of the 3-channel loss
|
| 90 |
+
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
|
| 91 |
+
capture peak VRAM and per-step latency, and exit. Single L4, single container.
|
| 92 |
+
|
| 93 |
+
Run: modal run modal_app.py
|
| 94 |
+
Logs: the function's print() output streams to your terminal.
|
| 95 |
+
"""
|
| 96 |
+
|
| 97 |
+
from __future__ import annotations
|
| 98 |
+
|
| 99 |
+
import modal
|
| 100 |
+
|
| 101 |
+
# ---------------------------------------------------------------------------
|
| 102 |
+
# 1) App + image
|
| 103 |
+
# ---------------------------------------------------------------------------
|
| 104 |
+
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
|
| 105 |
+
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
|
| 106 |
+
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
|
| 107 |
+
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
|
| 108 |
+
# correct override hook (DeepWiki audit anchor: huggingface/trl).
|
| 109 |
+
image = (
|
| 110 |
+
modal.Image.debian_slim(python_version="3.11")
|
| 111 |
+
.apt_install("git")
|
| 112 |
+
.pip_install(
|
| 113 |
+
"torch==2.4.1", # CUDA 12.1 wheel from PyPI default index
|
| 114 |
+
"transformers==4.46.3",
|
| 115 |
+
"accelerate==1.1.1",
|
| 116 |
+
"peft==0.14.0",
|
| 117 |
+
"trl==0.12.2",
|
| 118 |
+
"datasets==3.1.0",
|
| 119 |
+
"huggingface_hub==0.26.5",
|
| 120 |
+
)
|
| 121 |
+
.env({
|
| 122 |
+
# Force HF to use the mounted Volume for model + dataset cache.
|
| 123 |
+
"HF_HOME": "/cache/hf",
|
| 124 |
+
"TRANSFORMERS_CACHE": "/cache/hf",
|
| 125 |
+
"HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model
|
| 126 |
+
# Make Python flush prints immediately so we see step times live.
|
| 127 |
+
"PYTHONUNBUFFERED": "1",
|
| 128 |
+
# Reproducibility for the smoke.
|
| 129 |
+
"TOKENIZERS_PARALLELISM": "false",
|
| 130 |
+
})
|
| 131 |
+
)
|
| 132 |
+
|
| 133 |
+
# ---------------------------------------------------------------------------
|
| 134 |
+
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
|
| 135 |
+
# ---------------------------------------------------------------------------
|
| 136 |
+
# 1 GB of Qwen weights persists here. First run pays the download cost,
|
| 137 |
+
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
|
| 138 |
+
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)
|
| 139 |
+
|
| 140 |
+
# ---------------------------------------------------------------------------
|
| 141 |
+
# 3) App + secrets
|
| 142 |
+
# ---------------------------------------------------------------------------
|
| 143 |
+
app = modal.App("composer-replication-smoke")
|
| 144 |
+
|
| 145 |
+
# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
|
| 146 |
+
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety
|
| 147 |
+
|
| 148 |
+
# ---------------------------------------------------------------------------
|
| 149 |
+
# 4) The smoke function
|
| 150 |
+
# ---------------------------------------------------------------------------
|
| 151 |
+
@app.function(
|
| 152 |
+
image=image,
|
| 153 |
+
gpu="L4", # see §1: cheapest 24 GB option that fits
|
| 154 |
+
cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B
|
| 155 |
+
memory=16 * 1024, # 16 GiB RAM is plenty
|
| 156 |
+
volumes={"/cache": hf_cache},
|
| 157 |
+
timeout=60 * 30, # hard 30-min cap matches the smoke spec
|
| 158 |
+
secrets=[hf_secret],
|
| 159 |
+
# NB: keep preemptible (default). Don't pay 3× to pin.
|
| 160 |
+
# NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
|
| 161 |
+
)
|
| 162 |
+
def smoke():
|
| 163 |
+
import time
|
| 164 |
+
import torch
|
| 165 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 166 |
+
|
| 167 |
+
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
|
| 168 |
+
|
| 169 |
+
print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
|
| 170 |
+
f"device={torch.cuda.get_device_name(0)} "
|
| 171 |
+
f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
|
| 172 |
+
|
| 173 |
+
# -------------------------------------------------------------------
|
| 174 |
+
# Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
|
| 175 |
+
# -------------------------------------------------------------------
|
| 176 |
+
t0 = time.perf_counter()
|
| 177 |
+
tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
|
| 178 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 179 |
+
MODEL_ID,
|
| 180 |
+
cache_dir="/cache/hf",
|
| 181 |
+
torch_dtype=torch.bfloat16,
|
| 182 |
+
device_map="cuda:0",
|
| 183 |
+
)
|
| 184 |
+
model.train()
|
| 185 |
+
print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
|
| 186 |
+
f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")
|
| 187 |
+
|
| 188 |
+
# -------------------------------------------------------------------
|
| 189 |
+
# 50-step verification loop.
|
| 190 |
+
#
|
| 191 |
+
# NOTE: this stub uses a synthetic batch — a single forward+backward
|
| 192 |
+
# against an LM-head loss — *not* the full 3-channel loss. The point
|
| 193 |
+
# is to (a) verify the Modal harness, (b) measure the per-step time
|
| 194 |
+
# of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
|
| 195 |
+
#
|
| 196 |
+
# Replace the body of the for-loop with the actual ComposerReplicationTrainer
|
| 197 |
+
# `_compute_loss` call once data_collator outputs are stubbed/mocked.
|
| 198 |
+
# See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
|
| 199 |
+
# -------------------------------------------------------------------
|
| 200 |
+
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)
|
| 201 |
+
|
| 202 |
+
# Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
|
| 203 |
+
B, T = 2, 1024
|
| 204 |
+
input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
|
| 205 |
+
labels = input_ids.clone()
|
| 206 |
+
|
| 207 |
+
torch.cuda.reset_peak_memory_stats()
|
| 208 |
+
step_times = []
|
| 209 |
+
for step in range(50):
|
| 210 |
+
t = time.perf_counter()
|
| 211 |
+
out = model(input_ids=input_ids, labels=labels)
|
| 212 |
+
out.loss.backward()
|
| 213 |
+
optimizer.step()
|
| 214 |
+
optimizer.zero_grad(set_to_none=True)
|
| 215 |
+
torch.cuda.synchronize()
|
| 216 |
+
dt = time.perf_counter() - t
|
| 217 |
+
step_times.append(dt)
|
| 218 |
+
if step % 10 == 0:
|
| 219 |
+
print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
|
| 220 |
+
f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")
|
| 221 |
+
|
| 222 |
+
# -------------------------------------------------------------------
|
| 223 |
+
# Final report.
|
| 224 |
+
# -------------------------------------------------------------------
|
| 225 |
+
median_ms = sorted(step_times)[len(step_times)//2] * 1000
|
| 226 |
+
p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
|
| 227 |
+
peak_gb = torch.cuda.max_memory_allocated() / 1e9
|
| 228 |
+
print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
|
| 229 |
+
f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")
|
| 230 |
+
|
| 231 |
+
# Persist cache for the next run.
|
| 232 |
+
hf_cache.commit()
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
@app.local_entrypoint()
|
| 236 |
+
def main():
|
| 237 |
+
smoke.remote()
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
### 2.1 What's deliberately *not* in the skeleton
|
| 241 |
+
|
| 242 |
+
- **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
|
| 243 |
+
- **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
|
| 244 |
+
- **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
|
| 245 |
+
- **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
|
| 246 |
+
- **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
|
| 247 |
+
- **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.
|
| 248 |
+
|
| 249 |
+
### 2.2 What you do need to add when you wire the real loss
|
| 250 |
+
|
| 251 |
+
Replace the synthetic `for step in range(50)` body with:
|
| 252 |
+
|
| 253 |
+
```python
|
| 254 |
+
from data_collator import ComposerDataCollator # spike 005 path
|
| 255 |
+
from trl_path.composer_trainer import ComposerReplicationTrainer
|
| 256 |
+
# ...
|
| 257 |
+
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
|
| 258 |
+
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
|
| 259 |
+
# point is to verify the loss path, not the rollout path.
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
## 3. Gotchas that bite *this specific workload*
|
| 267 |
+
|
| 268 |
+
The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B���30B training. Most of them don't apply here. The ones that do:
|
| 269 |
+
|
| 270 |
+
### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly
|
| 271 |
+
|
| 272 |
+
`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):
|
| 273 |
+
|
| 274 |
+
```python
|
| 275 |
+
student_logits = model(input_ids=inputs["input_ids"]).logits # with grad
|
| 276 |
+
with torch.no_grad():
|
| 277 |
+
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
Two issues:
|
| 281 |
+
|
| 282 |
+
1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
|
| 283 |
+
2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.
|
| 284 |
+
|
| 285 |
+
### 3.2 The DPO channel does TWO more grad'd forwards per step
|
| 286 |
+
|
| 287 |
+
`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:
|
| 288 |
+
|
| 289 |
+
| Forward | Grad? | Notes |
|
| 290 |
+
|---------|-------|-------|
|
| 291 |
+
| `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
|
| 292 |
+
| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
|
| 293 |
+
| Teacher in SDPO | no | hint-conditioned context |
|
| 294 |
+
| DPO chosen | yes | only when beta_replay ≠ 0 |
|
| 295 |
+
| DPO rejected | yes | only when beta_replay ≠ 0 |
|
| 296 |
+
|
| 297 |
+
**That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.
|
| 298 |
+
|
| 299 |
+
For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.
|
| 300 |
+
|
| 301 |
+
### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun
|
| 302 |
+
|
| 303 |
+
In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:
|
| 304 |
+
|
| 305 |
+
```python
|
| 306 |
+
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
|
| 307 |
+
```
|
| 308 |
+
|
| 309 |
+
This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.
|
| 310 |
+
|
| 311 |
+
Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.
|
| 312 |
+
|
| 313 |
+
### 3.4 `torch.cuda.synchronize()` before timing reads
|
| 314 |
+
|
| 315 |
+
If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.
|
| 316 |
+
|
| 317 |
+
### 3.5 The HF cache Volume must be `commit()`ed
|
| 318 |
+
|
| 319 |
+
From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.
|
| 320 |
+
|
| 321 |
+
### 3.6 What does NOT bite
|
| 322 |
+
|
| 323 |
+
These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:
|
| 324 |
+
|
| 325 |
+
- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
|
| 326 |
+
- ❌ `accelerate launch` / multi-process distributed. Single GPU.
|
| 327 |
+
- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
|
| 328 |
+
- ❌ Tensor parallelism / sequence parallelism. Single GPU.
|
| 329 |
+
- ❌ Multi-node clusters. Single node.
|
| 330 |
+
- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
|
| 331 |
+
- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
|
| 332 |
+
- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.
|
| 333 |
+
|
| 334 |
+
---
|
| 335 |
+
|
| 336 |
+
## 4. Decision rule: Modal vs the local 5090
|
| 337 |
+
|
| 338 |
+
### 4.1 The numbers
|
| 339 |
+
|
| 340 |
+
**Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
|
| 341 |
+
- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
|
| 342 |
+
- 50 steps: **~15 seconds of pure compute**.
|
| 343 |
+
- Plus model load (one-time, from local HF cache): ~5 seconds.
|
| 344 |
+
- Plus data collator setup: ~3 seconds.
|
| 345 |
+
- **Wall clock: ~25–40 seconds.**
|
| 346 |
+
- **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).
|
| 347 |
+
|
| 348 |
+
**Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
|
| 349 |
+
- Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
|
| 350 |
+
- 50 steps: **~100 seconds of pure compute**.
|
| 351 |
+
- Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
|
| 352 |
+
- **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
|
| 353 |
+
- **Cost: $0.08–$0.13 per run.**
|
| 354 |
+
|
| 355 |
+
### 4.2 The decision rule
|
| 356 |
+
|
| 357 |
+
> **For this specific 30-min smoke: run on the 5090. Do not use Modal.**
|
| 358 |
+
|
| 359 |
+
Reasoning:
|
| 360 |
+
|
| 361 |
+
1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
|
| 362 |
+
2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
|
| 363 |
+
3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
|
| 364 |
+
4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
|
| 365 |
+
5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.
|
| 366 |
+
|
| 367 |
+
**When Modal becomes correct:**
|
| 368 |
+
|
| 369 |
+
| Scenario | Modal? | Why |
|
| 370 |
+
|----------|--------|-----|
|
| 371 |
+
| 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
|
| 372 |
+
| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
|
| 373 |
+
| Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
|
| 374 |
+
| Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
|
| 375 |
+
| 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |
|
| 376 |
+
|
| 377 |
+
### 4.3 Recommended workflow
|
| 378 |
+
|
| 379 |
+
1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
|
| 380 |
+
2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
|
| 381 |
+
3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.
|
| 382 |
+
|
| 383 |
+
---
|
| 384 |
+
|
| 385 |
+
## 5. References
|
| 386 |
+
|
| 387 |
+
All claims in this document are sourced from:
|
| 388 |
+
|
| 389 |
+
- **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
|
| 390 |
+
- **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
|
| 391 |
+
- **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
|
| 392 |
+
- **Volumes**: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence.
|
| 393 |
+
- **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
|
| 394 |
+
- **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
|
| 395 |
+
- **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
|
| 396 |
+
- **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
|
| 397 |
+
- **Local trainer code reviewed**:
|
| 398 |
+
- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
|
| 399 |
+
- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`
|
| 400 |
+
|
| 401 |
+
---
|
| 402 |
+
|
| 403 |
+
## 6. TL;DR
|
| 404 |
+
|
| 405 |
+
1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
|
| 406 |
+
2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
|
| 407 |
+
3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
|
| 408 |
+
4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.
|
|
@@ -0,0 +1,403 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TRACE_SOURCE_RECONNAISSANCE.md
|
| 2 |
+
|
| 3 |
+
Spike 007 trace-source audit, feeding ADR-002.
|
| 4 |
+
|
| 5 |
+
Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 0. TL;DR
|
| 10 |
+
|
| 11 |
+
Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.
|
| 12 |
+
|
| 13 |
+
The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 1. Context: TraceExample dataclass field reality
|
| 18 |
+
|
| 19 |
+
**Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
|
| 20 |
+
`/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:
|
| 21 |
+
|
| 22 |
+
```python
|
| 23 |
+
class TraceState(TypedDict):
|
| 24 |
+
state_id: str # unique within the trace
|
| 25 |
+
messages: list[dict] # conversation up to and including this step's user prompt
|
| 26 |
+
student_action: str # what the student actually did at this step
|
| 27 |
+
|
| 28 |
+
class DPOPair(TypedDict):
|
| 29 |
+
state_id: str
|
| 30 |
+
state_messages: list[dict]
|
| 31 |
+
chosen: str # teacher-consensus action
|
| 32 |
+
rejected: str # student action
|
| 33 |
+
n_teachers_agreeing: int
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 2. Candidate audit summary
|
| 41 |
+
|
| 42 |
+
Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.
|
| 43 |
+
|
| 44 |
+
| # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict |
|
| 45 |
+
|---|---|---|---|---|---|---|
|
| 46 |
+
| **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** |
|
| 47 |
+
| b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject |
|
| 48 |
+
| c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up |
|
| 49 |
+
| d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject |
|
| 50 |
+
| e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set |
|
| 51 |
+
| f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** |
|
| 52 |
+
|
| 53 |
+
The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## 3. Chosen format spec — Claude Code session JSONL
|
| 58 |
+
|
| 59 |
+
### 3.1 Location and naming
|
| 60 |
+
|
| 61 |
+
- **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
|
| 62 |
+
Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
|
| 63 |
+
- **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
|
| 64 |
+
Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
|
| 65 |
+
- **File**: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions.
|
| 66 |
+
Source: same `claude_skills` doc, §"Subagent File Location".
|
| 67 |
+
- **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
|
| 68 |
+
Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")
|
| 69 |
+
|
| 70 |
+
### 3.2 Common record fields
|
| 71 |
+
|
| 72 |
+
Every record (both user and assistant types) carries:
|
| 73 |
+
|
| 74 |
+
| field | type | meaning |
|
| 75 |
+
|---|---|---|
|
| 76 |
+
| `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) |
|
| 77 |
+
| `uuid` | `string` | This record's UUID |
|
| 78 |
+
| `sessionId` | `string` | UUID of the session (matches filename) |
|
| 79 |
+
| `timestamp` | `string` (ISO-8601) | Wall-clock time of the record |
|
| 80 |
+
| `cwd` | `string` | Absolute working directory |
|
| 81 |
+
| `version` | `string` | Claude Code version (e.g. `"2.1.143"`) |
|
| 82 |
+
| `gitBranch` | `string` | Empty string `""` when not in a git repo |
|
| 83 |
+
| `isSidechain` | `boolean` | True for sub-agent (Task tool) chains |
|
| 84 |
+
| `userType` | `string` | `"external"` or similar |
|
| 85 |
+
| `type` | `string` | Discriminator — see §3.3 |
|
| 86 |
+
| `entrypoint` | `string` | e.g. `"sdk-cli"` |
|
| 87 |
+
|
| 88 |
+
Sources for these fields:
|
| 89 |
+
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
|
| 90 |
+
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
|
| 91 |
+
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
|
| 92 |
+
- Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.
|
| 93 |
+
|
| 94 |
+
### 3.3 Record types (`type` discriminator)
|
| 95 |
+
|
| 96 |
+
| `type` | Role |
|
| 97 |
+
|---|---|
|
| 98 |
+
| `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) |
|
| 99 |
+
| `assistant` | Model output: text, `thinking`, and `tool_use` blocks |
|
| 100 |
+
| `system` | Hook summaries, stop notices |
|
| 101 |
+
| `summary` | Context-compaction markers |
|
| 102 |
+
| `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output |
|
| 103 |
+
| `queue-operation` | Prompt enqueue/dequeue events |
|
| 104 |
+
| `file-history-snapshot` | File-state tracking for undo |
|
| 105 |
+
| `last-prompt` | Bookkeeping for resume |
|
| 106 |
+
|
| 107 |
+
Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.
|
| 108 |
+
|
| 109 |
+
### 3.4 The two record types we care about
|
| 110 |
+
|
| 111 |
+
#### Assistant record carrying a tool call (the "student action")
|
| 112 |
+
|
| 113 |
+
Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:
|
| 114 |
+
|
| 115 |
+
```json
|
| 116 |
+
{
|
| 117 |
+
"type": "assistant",
|
| 118 |
+
"uuid": "24a16a51-3133-4ba5-9d23-472864286154",
|
| 119 |
+
"parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
|
| 120 |
+
"sessionId": "39df59f0-…",
|
| 121 |
+
"timestamp": "2026-05-16T04:52:21.947Z",
|
| 122 |
+
"message": {
|
| 123 |
+
"role": "assistant",
|
| 124 |
+
"model": "claude-opus-4-7",
|
| 125 |
+
"content": [
|
| 126 |
+
{
|
| 127 |
+
"type": "tool_use",
|
| 128 |
+
"id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
|
| 129 |
+
"name": "Bash",
|
| 130 |
+
"input": {
|
| 131 |
+
"command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
|
| 132 |
+
"description": "Check builder agent inbox"
|
| 133 |
+
}
|
| 134 |
+
}
|
| 135 |
+
],
|
| 136 |
+
"stop_reason": "tool_use",
|
| 137 |
+
"usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
|
| 138 |
+
}
|
| 139 |
+
}
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).
|
| 143 |
+
|
| 144 |
+
#### User record carrying a tool result (the "observation")
|
| 145 |
+
|
| 146 |
+
```json
|
| 147 |
+
{
|
| 148 |
+
"type": "user",
|
| 149 |
+
"uuid": "b9f9414b-…",
|
| 150 |
+
"parentUuid": "24a16a51-…", // matches the assistant uuid above
|
| 151 |
+
"sessionId": "39df59f0-…",
|
| 152 |
+
"timestamp": "2026-05-16T04:52:23.229Z",
|
| 153 |
+
"message": {
|
| 154 |
+
"role": "user",
|
| 155 |
+
"content": [
|
| 156 |
+
{
|
| 157 |
+
"tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
|
| 158 |
+
"type": "tool_result",
|
| 159 |
+
"content": " No new messages",
|
| 160 |
+
"is_error": false
|
| 161 |
+
}
|
| 162 |
+
]
|
| 163 |
+
},
|
| 164 |
+
"toolUseResult": { // duplicate, structured form
|
| 165 |
+
"stdout": " No new messages",
|
| 166 |
+
"stderr": "",
|
| 167 |
+
"interrupted": false,
|
| 168 |
+
"isImage": false,
|
| 169 |
+
"noOutputExpected": false
|
| 170 |
+
},
|
| 171 |
+
"sourceToolAssistantUUID": "24a16a51-…" // back-pointer to the assistant uuid
|
| 172 |
+
}
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).
|
| 176 |
+
|
| 177 |
+
### 3.5 Schema stability
|
| 178 |
+
|
| 179 |
+
- **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema.
|
| 180 |
+
- **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
|
| 181 |
+
- **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).
|
| 182 |
+
|
| 183 |
+
### 3.6 Licensing
|
| 184 |
+
|
| 185 |
+
- The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
|
| 186 |
+
- The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
|
| 187 |
+
- Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
|
| 188 |
+
- We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## 4. Acquiring the 5 real example traces
|
| 193 |
+
|
| 194 |
+
**Zero acquisition cost.** All five live on this machine right now.
|
| 195 |
+
|
| 196 |
+
Discovery command (used during this audit):
|
| 197 |
+
|
| 198 |
+
```bash
|
| 199 |
+
find ~/.claude/projects -name "*.jsonl" 2>/dev/null
|
| 200 |
+
# → 1015 files
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:
|
| 204 |
+
|
| 205 |
+
| # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path |
|
| 206 |
+
|---|---|---|---|---|---|
|
| 207 |
+
| 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` |
|
| 208 |
+
| 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` |
|
| 209 |
+
| 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` |
|
| 210 |
+
| 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` |
|
| 211 |
+
| 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` |
|
| 212 |
+
|
| 213 |
+
(All five inspected programmatically during this audit — counts above are real, not estimates.)
|
| 214 |
+
|
| 215 |
+
For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces.
|
| 216 |
+
|
| 217 |
+
---
|
| 218 |
+
|
| 219 |
+
## 5. Decision-relevant tradeoffs vs runners-up
|
| 220 |
+
|
| 221 |
+
### Why we are NOT picking OpenHands trajectories (c)
|
| 222 |
+
- **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing).
|
| 223 |
+
- **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
|
| 224 |
+
- **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
|
| 225 |
+
- **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.
|
| 226 |
+
|
| 227 |
+
### Why we are NOT picking SWE-bench leaderboard trajectories (e)
|
| 228 |
+
- **Pro**: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
|
| 229 |
+
- **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
|
| 230 |
+
- **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike.
|
| 231 |
+
|
| 232 |
+
### Why we are NOT picking Aider (d)
|
| 233 |
+
- The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
|
| 234 |
+
- **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.
|
| 235 |
+
|
| 236 |
+
### Why we are NOT picking Cline (b)
|
| 237 |
+
- No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.
|
| 238 |
+
|
| 239 |
+
### Why we are NOT picking SWE-smith-trajectories (f)
|
| 240 |
+
- This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
|
| 241 |
+
- **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
+
## 6. TraceIngester sketch
|
| 246 |
+
|
| 247 |
+
Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
|
| 248 |
+
|
| 249 |
+
```python
|
| 250 |
+
# spikes/007-trace-ingester/trace_ingester.py
|
| 251 |
+
from __future__ import annotations
|
| 252 |
+
import json
|
| 253 |
+
from collections.abc import Iterator
|
| 254 |
+
from pathlib import Path
|
| 255 |
+
from typing import Any
|
| 256 |
+
|
| 257 |
+
# Re-use the existing TypedDicts from spike-005:
|
| 258 |
+
# from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState
|
| 259 |
+
|
| 260 |
+
# A "step" in the trace is each assistant record that ends in tool_use. The
|
| 261 |
+
# state visible to the model at that step = all messages strictly before it,
|
| 262 |
+
# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).
|
| 263 |
+
|
| 264 |
+
def _record_to_chat_message(rec: dict) -> dict | None:
|
| 265 |
+
"""Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
|
| 266 |
+
dict, or return None for non-conversational records (queue-operation,
|
| 267 |
+
attachment, file-history-snapshot, system, last-prompt, summary)."""
|
| 268 |
+
t = rec.get("type")
|
| 269 |
+
if t not in ("user", "assistant"):
|
| 270 |
+
return None
|
| 271 |
+
msg = rec.get("message")
|
| 272 |
+
if not isinstance(msg, dict):
|
| 273 |
+
return None
|
| 274 |
+
role = msg.get("role")
|
| 275 |
+
content = msg.get("content")
|
| 276 |
+
if role not in ("user", "assistant") or content is None:
|
| 277 |
+
return None
|
| 278 |
+
# Strip thinking blocks — they are not portable across teacher models and
|
| 279 |
+
# should not influence the teacher's decision at replay time.
|
| 280 |
+
if isinstance(content, list):
|
| 281 |
+
content = [c for c in content
|
| 282 |
+
if not (isinstance(c, dict) and c.get("type") == "thinking")]
|
| 283 |
+
return {"role": role, "content": content}
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
def _serialize_action(content_blocks: list[dict]) -> str:
|
| 287 |
+
"""Canonicalize the student's action at a step.
|
| 288 |
+
|
| 289 |
+
For tool_use steps: JSON-encode the (name, input) pairs.
|
| 290 |
+
For text-only steps: return the concatenated text.
|
| 291 |
+
"""
|
| 292 |
+
tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
|
| 293 |
+
if tool_uses:
|
| 294 |
+
return json.dumps(
|
| 295 |
+
[{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
|
| 296 |
+
sort_keys=True,
|
| 297 |
+
)
|
| 298 |
+
texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
|
| 299 |
+
return "\n".join(t for t in texts if t)
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
class TraceIngester:
|
| 303 |
+
"""Reads a Claude Code session JSONL and yields TraceState records.
|
| 304 |
+
|
| 305 |
+
One TraceState is emitted per assistant record. The `messages` field is the
|
| 306 |
+
full prior conversation (system + alternating user/assistant) up to but not
|
| 307 |
+
including the current assistant turn; `student_action` is the canonicalized
|
| 308 |
+
serialization of that turn's content blocks.
|
| 309 |
+
"""
|
| 310 |
+
|
| 311 |
+
def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
|
| 312 |
+
self.skip_thinking = skip_thinking
|
| 313 |
+
self.min_action_chars = min_action_chars
|
| 314 |
+
|
| 315 |
+
def ingest(self, path: str | Path) -> Iterator[dict]: # yields TraceState
|
| 316 |
+
path = Path(path)
|
| 317 |
+
prior_messages: list[dict] = []
|
| 318 |
+
session_id_for_state = path.stem # filename = session UUID
|
| 319 |
+
|
| 320 |
+
with path.open("r", encoding="utf-8") as f:
|
| 321 |
+
for line_idx, line in enumerate(f):
|
| 322 |
+
line = line.strip()
|
| 323 |
+
if not line:
|
| 324 |
+
continue
|
| 325 |
+
try:
|
| 326 |
+
rec = json.loads(line)
|
| 327 |
+
except json.JSONDecodeError:
|
| 328 |
+
continue # tolerate truncated last-line writes
|
| 329 |
+
|
| 330 |
+
chat_msg = _record_to_chat_message(rec)
|
| 331 |
+
if chat_msg is None:
|
| 332 |
+
continue
|
| 333 |
+
|
| 334 |
+
if chat_msg["role"] == "assistant":
|
| 335 |
+
# Emit a TraceState representing "before this turn".
|
| 336 |
+
blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
|
| 337 |
+
student_action = _serialize_action(blocks)
|
| 338 |
+
if len(student_action) >= self.min_action_chars:
|
| 339 |
+
yield {
|
| 340 |
+
"state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
|
| 341 |
+
"messages": list(prior_messages), # snapshot
|
| 342 |
+
"student_action": student_action,
|
| 343 |
+
}
|
| 344 |
+
# Append to history regardless (so subsequent turns see it).
|
| 345 |
+
prior_messages.append(chat_msg)
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
Notes:
|
| 349 |
+
- We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
|
| 350 |
+
- We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
|
| 351 |
+
- `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
|
| 352 |
+
- Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.
|
| 353 |
+
|
| 354 |
+
### 6.1 Smoke-test plan (for Spike 007 itself)
|
| 355 |
+
|
| 356 |
+
```python
|
| 357 |
+
ingester = TraceIngester()
|
| 358 |
+
states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
|
| 359 |
+
# Expect roughly 197 states (matches asst-message count counted in §4).
|
| 360 |
+
# Then teacher-replay on the first 5 states, confirm cost is in the
|
| 361 |
+
# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
|
| 362 |
+
```
|
| 363 |
+
|
| 364 |
+
Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up.
|
| 365 |
+
|
| 366 |
+
---
|
| 367 |
+
|
| 368 |
+
## 7. Open questions for ADR-002
|
| 369 |
+
|
| 370 |
+
1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
|
| 371 |
+
2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
|
| 372 |
+
3. Synthetic system prompt at replay time — yes/no? If yes, what content?
|
| 373 |
+
4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
|
| 374 |
+
5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.
|
| 375 |
+
|
| 376 |
+
---
|
| 377 |
+
|
| 378 |
+
## 8. References (primary sources only)
|
| 379 |
+
|
| 380 |
+
Anthropic / Claude Code official:
|
| 381 |
+
- <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
|
| 382 |
+
- <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
|
| 383 |
+
- <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
|
| 384 |
+
- <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license
|
| 385 |
+
|
| 386 |
+
Community schemas (reverse-engineered from real session data):
|
| 387 |
+
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
|
| 388 |
+
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
|
| 389 |
+
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
|
| 390 |
+
- <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
|
| 391 |
+
- <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs
|
| 392 |
+
|
| 393 |
+
Runners-up reference points:
|
| 394 |
+
- OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
|
| 395 |
+
- SWE-bench experiments: <https://github.com/swe-bench/experiments>
|
| 396 |
+
- SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
|
| 397 |
+
- mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
|
| 398 |
+
- swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
|
| 399 |
+
- Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>
|
| 400 |
+
|
| 401 |
+
Internal references:
|
| 402 |
+
- `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
|
| 403 |
+
- Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating
|