Wave 11: cross-model adversarial review + honest down-revision

Phase 11 of the deep work loop ran a cross-model adversarial review on the
Wave 7-10 work. The review (anthropic/claude-opus-4.7, separate session) found
10 substantive flaws including 2 BLOCKERs:

1. **Scorecard inflation** — initial self-claim was 5/10 → 9/10, honest
re-scoring is 5/10 → 7/10 ✅ + 1/10 ⚠️ + 2/10 ❌-spirit.
2. **Spike 008 acceptance criterion silently redefined** — BACKLOG required
"2 replicas converge"; what shipped is "1 replica + passthrough no-op
allreduce." The recon doc's "ready-to-paste 2-replica pattern" hits a
single-process post-hook sequencing bug.

Both BLOCKERs addressed in this commit:

- docs/research/WAVE_7_10_FINAL_REVIEW.md — full 423-line review committed.
- docs/VISION_VALIDATION.md — replace 9/10 self-claim with honest 7/10
re-scoring + per-test caveats acknowledging the spirit-failures of tests
6/8/10. Explicitly notes "the framework as of this commit has zero GPU
evidence of any kind."
- spikes/008-streaming-diloco/verdict.md — status changed from "✅ PASSED"
to "⚠️ PARTIAL." Adds a "Honest re-statement" section explaining the
acceptance-criterion redefinition and what would be needed to close the
BACKLOG criterion in spirit (multi-process test ~200 LOC).
- README.md — Spike 6 marked "PASSED with caveat" (SDPO channel never
exercises on a real model in the smoke). Spike 7 acknowledges the
end-to-end ingester→collator→loss test is missing. Spike 8 marked ⚠️
PARTIAL not 🟢 PASSED.

Two cheaper improvements also landed:

- docs/adrs/ADR-002-trace-source.md — adds a "Risk added 2026-05-26 by
cross-model review" section flagging the Claude Code circularity:
training a student on Claude's outputs while Claude is in the teacher
pool produces biased disagreement. Mitigation (drop Claude from the
pool when ingesting Claude traces) documented; not yet enforced in code.
- composer_replication/diloco/__init__.py + spikes/008-.../composer_diloco.py
— sign-convention docstrings rewritten. Old prose ("wrong-sign +
SGD subtract = right answer with momentum") was incoherent. New version
derives p.data ← θ_initial - lr*pseudograd = θ_initial + lr*(θ_local -
θ_initial) cleanly, no hand-waving about momentum.

Items NOT addressed in this commit (deferred to next session or post-GPU):
- 3. Strengthen Spike 006 against tautology (use 2 alternating batches +
real SDPO firing path)
- 4. Run Spike 002a-mini on local 5090 (zero GPU evidence anywhere yet)
- 5. Reconcile run.log vs verdict.md numerical inconsistency (seed pinning)
- 7. Decide what compose_loss/build_batch are (verification harness vs
production)
- 8. Eliminate dual sources of truth (spike copy vs package copy)
- 9. Add the missing real-trace end-to-end test in Spike 007

The deep work loop's design says "publish the cross-model review with the
work, not before reviewing." This commit honors that — the BLOCKER fixes
land BEFORE the public push.

Tests still 67/67 across spikes 005/007/008 (Spike 008 changes are
docstring-only, machinery test still passes).

Files changed (7) hide show

README.md +4 -4
composer_replication/diloco/__init__.py +25 -15
docs/VISION_VALIDATION.md +18 -0
docs/adrs/ADR-002-trace-source.md +19 -0
docs/research/WAVE_7_10_FINAL_REVIEW.md +423 -0
spikes/008-streaming-diloco/composer_diloco.py +19 -15
spikes/008-streaming-diloco/verdict.md +73 -40

README.md CHANGED Viewed

@@ -48,10 +48,10 @@ for what the output should look like.
 **v0.1 spike progress (2026-05-26):**
 - 🟢 Spike 001 (kill-switch teacher cost) — **VALIDATED**: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
 - 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
-- 🟢 Spike 006 (real HF model smoke) — **PASSED**: Qwen2.5-0.5B-Instruct via `AutoModelForCausalLM`, 5 backward steps on CPU, loss 0.7390 → 0.0031 (99.6% reduction), all gradients finite. Closes vision-validation gap V8.
-- 🟢 Spike 007 (real trace ingestion) — **PASSED**: `ClaudeCodeIngester.ingest()` converts Claude Code session JSONL → `TraceState` records. 15/15 tests including a real-session smoke. Closes V5.
-- 🟢 Spike 008 (DiLoCo outer-loop smoke) — **PASSED**: `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained). 5/5 tests including pseudo-gradient sign-convention verification. Closes V2.
-- 🟢 Wave 10 (packaging) — **DONE**: `pip install -e .` works; `composer_replication` package re-exports the verified APIs from the spike directories.
 - 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
 📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.

 **v0.1 spike progress (2026-05-26):**
 - 🟢 Spike 001 (kill-switch teacher cost) — **VALIDATED**: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
 - 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
+- 🟢 Spike 006 (real HF model smoke) — **PASSED with caveat**: Qwen2.5-0.5B-Instruct via `AutoModelForCausalLM`, 5 backward steps on CPU, loss 0.7390 → 0.0031, all gradients finite. Closes vision-validation gap V8 in the literal sense. Caveat: the SDPO channel is `0.0` throughout (silently disabled by ctx-shape mismatch — the fallback is correct but means SDPO is not exercised end-to-end on a real model anywhere yet) and DPO uses dummy reference logprobs.
+- 🟢 Spike 007 (real trace ingestion) — **PASSED**: `ClaudeCodeIngester.ingest()` converts Claude Code session JSONL → `TraceState` records. 15/15 tests including a real-session smoke. ⚠️ End-to-end ingester → collator → loss test still missing (V5 spirit gap); see VISION_VALIDATION § 3 update.
+- ⚠️ Spike 008 (DiLoCo outer-loop smoke) — **PARTIAL**: `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo`. 5/5 single-process tests pass including a pseudo-gradient sign-convention pin. **But** the BACKLOG required a 2-replica convergence smoke; what shipped is 1-replica machinery + passthrough no-op `allreduce`. True multi-process DiLoCo is GPU-gated and not yet attempted.
+- 🟢 Wave 10 (packaging) — **DONE**: `pip install -e .` works; `composer_replication` package re-exports the verified APIs from the spike directories. `compose_loss` and `build_batch` are explicitly verification-harness public APIs (production loss is `ComposerReplicationTrainer._compute_loss`).
 - 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
 📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.

composer_replication/diloco/__init__.py CHANGED Viewed

@@ -11,21 +11,31 @@ Wraps `torchft.local_sgd.DiLoCo` with the framework's conventions:
 Reference: `docs/adrs/ADR-003-diloco-impl.md`.
 Sign convention (READ THIS BEFORE TOUCHING):
-    torchft's `_save_grads()` (line 324 of torchft/local_sgd.py) computes
-        grad = θ_initial - θ_local
-    and stores it as `param.grad` for the outer optimizer to consume.
-    The outer optimizer then runs `param.data -= lr * grad`, equivalently
-        θ_new = θ_local + lr * (θ_initial - θ_local)  if outer optimizer is plain SGD
-    which slurps the local-trained-θ TOWARD the initial-θ instead of away
-    from it. That looks wrong, but it's correct for SGD-with-Nesterov-momentum
-    on outer loop: the outer optimizer accumulates the negative-grad-direction
-    history, so the "wrong-sign" pseudogradient combined with SGD's "subtract
-    grad" semantics gives net "step in the local-Δ direction" once momentum
-    builds up. This is consistent with the DiLoCo paper's pseudo-code.
-    Bottom line: do NOT negate. torchft's pseudogradient sign + SGD outer
-    optimizer is the correct combination. Spike 008's
-    `test_diloco_pseudogradient_sign_convention` test catches a sign flip.
 """
 from __future__ import annotations

 Reference: `docs/adrs/ADR-003-diloco-impl.md`.
 Sign convention (READ THIS BEFORE TOUCHING):
+    DiLoCo defines pseudo-gradient as
+        pseudograd = θ_initial - θ_local
+    (per torchft's `_save_grads()` at line 324 of `torchft/local_sgd.py`).
+    This is the **negative** of the local update direction (the local
+    update *moved* params from θ_initial toward θ_local).
+    Standard SGD subtracts gradients: `p.data ← p.data - lr * grad`.
+    So when the outer optimizer runs after `restore_parameters()` puts
+    p.data back to θ_initial:
+        p.data ← θ_initial - lr * (θ_initial - θ_local)
+               = θ_initial + lr * (θ_local - θ_initial)
+    For `lr=1, momentum=0` this lands exactly at θ_local. For `lr<1` it
+    interpolates between θ_initial and θ_local (the standard DiLoCo outer
+    step). Adding Nesterov momentum accumulates the local-update direction
+    across outer rounds.
+    No negation in our outer optimizer wrapper. The test
+    `test_diloco_pseudogradient_sign_convention` in
+    `spikes/008-streaming-diloco/tests/test_diloco_smoke.py` pins this
+    arithmetic and reports both the expected and the wrong-sign value on
+    failure for fast diagnosis if torchft ever flips its convention.
 """
 from __future__ import annotations

docs/VISION_VALIDATION.md CHANGED Viewed

@@ -59,6 +59,24 @@ Ten concrete pass/fail tests covering both "do we encapsulate the vision" and "i
 **Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
 ## 4. The four real gaps, each examined
 ### 4.1 V2: DiLoCo deferral — is this a drift?

 **Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
+> **Update 2026-05-26 — Wave 7+8+9+10 closeout (deep work loop) + cross-model audit**
+>
+> Initial self-claim was 5/10 → 9/10. A cross-model adversarial review (Phase 11 of the deep work loop, doc at `docs/research/WAVE_7_10_FINAL_REVIEW.md`) found three of those ✅s were letter-of-the-law rather than spirit. **Honest re-scoring: 5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
+>
+> | # | Test | Status | New evidence + honest caveat |
+> |---|---|---|---|
+> | 6 | DiLoCo integrated in runnable stack | ⚠️ partial | Spike 008 has `composer_replication.diloco.make_diloco_outer_loop` wrapping `torchft.local_sgd.DiLoCo`, with a 5/5 test suite that pins the sign convention. **But**: the BACKLOG required a *2-replica convergence* smoke; what shipped is a 1-replica machinery test with a passthrough no-op `allreduce`. The recon doc's "ready-to-paste 2-replica pattern" hits a single-process post-hook sequencing bug we couldn't fix without rewriting torchft. The DiLoCo wrapper is also **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. Calling V2 ✅ overstates: real DiLoCo training is GPU-multi-process which we haven't touched. |
+> | 7 | Real HF model loads + runs through `compose_loss` | ✅ (with caveat below) | Spike 006 — Qwen2.5-0.5B-Instruct on CPU, 5 backward steps, loss 0.7390 → 0.0031, 9/9 tests. **Caveat**: SDPO channel is `0.0` throughout (silently disabled by ctx_student vs ctx_teacher shape mismatch — correct fallback, but means SDPO is not exercised end-to-end on a real model anywhere in the repo yet). DPO uses dummy reference logprobs. The 5-step loss decrease on a fixed batch is closer to "memorization works" than "the 3-channel composition is correct." Still: the framework now demonstrably loads real HF models, which it didn't before. V8 is closed in the literal sense. |
+> | 8 | Real LLM-application trace ingested end-to-end | ❌ spirit | Spike 007 — `ClaudeCodeIngester` ingests real Claude Code session JSONL → `TraceState` records, 15/15 tests including a real-session smoke. **But**: BACKLOG acceptance criterion #3 said "end-to-end smoke: real trace → ingester → collator → 1-step `compose_loss`." That last hop is **not tested**. The spike stops at "ingester emits TraceStates correctly." Closing V5 in spirit needs a 50-LOC test that pipes ingested records all the way through the loss. Open. |
+> | 9 | Framework is *installable* with working entrypoints | ✅ | Wave 10 — `pyproject.toml` ships `composer_replication` package, `pip install -e .` works, `examples/qwen_05b_quickstart/run.py` runs end-to-end via the package API. (Caveat acknowledged: `compose_loss` is documented as a verification harness, not production. The production loss is `ComposerReplicationTrainer._compute_loss`.) |
+> | 10 | Non-author can complete the "I have X, I want a Composer variant" journey | ❌ spirit | Quickstart works for "verify the loss composition runs" but not for "train a real model" — that requires real GRPO rollouts, real teacher calls, and GPU. The brief's intended user wants the latter. We have not closed that path. |
+>
+> **The remaining 1/10 + 2/10 spirit gaps + the unverified 9/10 ⚠️** are the post-replication GPU phase: Spike 002a/b (real trace collection on GPU), Spike 003 (DPO-pair signal density), Spike 004 (A/B SWE-bench-lite), and a real-multi-process DiLoCo test. Those are GPU-budget-gated and out of scope for the deep work loop's CPU-only constraint.
+>
+> **Time spent on Wave 7-10**: ~1 session. **No GPU spend.** Modal evaluated but rejected for the smoke phase (ADR-001 — local 5090 wins on iteration cycle 10× over Modal L4 for 0.5B verification work). **The local 5090 was also not used** — Spike 002a-mini (the planned local-GPU smoke) was not run. The framework as of this commit has zero GPU evidence of any kind. That is honest about where this work lands: **a tested, installable methodology repo with real CPU smokes and primary-source-validated research, not a trained model.**
+>
+> Cross-model review's full priority list (10 items, ranked) is in `docs/research/WAVE_7_10_FINAL_REVIEW.md`. Items 1-2 (scorecard honesty + V2 re-statement) are addressed by this update. Items 3-10 are open.
 ## 4. The four real gaps, each examined
 ### 4.1 V2: DiLoCo deferral — is this a drift?

docs/adrs/ADR-002-trace-source.md CHANGED Viewed

@@ -117,6 +117,25 @@ Wins on every axis we care about for Spike 007:
   need to bump a `schema_version` constant in the ingester. Acceptable
   ongoing maintenance burden.
 ### Future ingesters
 Open the door for two more ingesters in v0.2:

   need to bump a `schema_version` constant in the ingester. Acceptable
   ongoing maintenance burden.
+### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT)
+- **Circularity / data-leakage in the teacher-replay channel.** Claude
+  Code traces are produced by Claude. Our default teacher pool
+  (`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a
+  student on Claude's outputs while Claude is one of the teachers
+  voting on what the student should do produces a biased disagreement
+  signal: Claude's vote is correlated with the trace's existing
+  `student_action` (which Claude originally produced). This biases the
+  multi-teacher consensus toward the existing answer.
+  - **Mitigation**: when ingesting Claude Code traces, the user should
+    drop Claude from the teacher pool and use a non-Claude consensus
+    (Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair).
+    Documented here; not yet enforced in code.
+  - **Open question for v0.2**: should `ClaudeCodeIngester` automatically
+    annotate the source-model field on each trace and `replay_trace`
+    automatically exclude same-family teachers? Defer the design until
+    the post-replication phase reveals whether the bias is observable.
 ### Future ingesters
 Open the door for two more ingesters in v0.2:

docs/research/WAVE_7_10_FINAL_REVIEW.md ADDED Viewed

	@@ -0,0 +1,423 @@

+# Wave 7–10 Final Review — Cross-model adversarial check
+**Reviewer**: external model, Phase 11 of the deep work loop.
+**Date**: 2026-05-26.
+**Mandate**: find substantive flaws. The research thesis is
+primary-source-validated; this attacks *implementation correctness* and
+*scope creep*, not the thesis.
+---
+## (a) Are the tests real evidence or theater?
+### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests)
+**Verdict: mostly tautology, with usable ablation tests.**
+The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is
+technically true and substantively near-tautological:
+1. **The same fixed ~50-token batch is reused for all 5 steps.**
+   `build_batch` returns one conversation; the test loop calls
+   `compose_loss(model, batch)` five times in a row. No reshuffle, no
+   second batch, no held-out anything.
+2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch ×
+   5 steps = textbook memorization regime.** A randomly-initialized MLP
+   would also reduce loss in this setup. The test does not distinguish
+   "the 3-channel composition is correct" from "AdamW reduces fixed-batch
+   loss on any non-degenerate objective."
+3. **The SDPO channel is zero throughout** (`sdpo_jsd=0.0` on every
+   row of `loss_curve.csv`). The verdict calls this "correct fallback
+   behavior"; what it actually is is *the entire SDPO channel never
+   being tested by this smoke*. The fallback is a literal `_zero(device)`.
+   `generalized_jsd_loss` has no end-to-end test on a real HF model
+   anywhere in the codebase. **This is the largest evidence gap for
+   V8.**
+4. **DPO uses dummy hard-coded reference logprobs** (`-30.0`, `-35.0`).
+   This tests that `-logsigmoid(small_positive)` is differentiable, not
+   that the trace-replay-DPO pipeline (reference-policy precompute +
+   collator + loss) wires together.
+5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the
+   weakest version of monotonicity.
+**Genuine value**: model-loads test, chat-template test, the three
+α=0 / β=0 ablation tests. The ablations would catch a regression where
+weights stop disabling channels. The "5-step decrease" is the weakest
+test in the file.
+**Run.log inconsistency, not flagged anywhere**:
+`examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379;
+the spike `verdict.md` quotes step-1 total = 0.2090 for the same code,
+same model. Either the seed isn't pinned through the model forward
+(likely — `torch.manual_seed(42)` is in `build_batch` only), or the
+package's `compose_loss` differs subtly from the spike's. **Quoting
+exact numbers from a non-reproducible run as evidence is the sloppy
+version of every research-replication scandal.**
+### Spike 007 (Claude Code ingester, 15 tests)
+**Verdict: strongest test suite of the three. Caveats apply.**
+Real engineering value:
+- Synthetic fixture exercises the actual record types (assistant,
+  user/tool_result, summary, system, sidechain). Tests assert structural
+  properties: history grows monotonically, `[THINKING]` stripped on
+  replay but kept in student_action, unique state_ids, tool_use
+  serialization, tool_result tagging.
+- `test_truncated_line_tolerated` would catch a real failure-mode
+  removal of the JSON-decode try/except.
+- Subagent and sidechain skip tests catch real production cases.
+Caveats:
+- **The "real session" test is hardcoded to one path on the author's
+  machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`).
+  No env var, no fixture-discovery; the test is `skipif(not exists)`.
+  This is a manual integration test, not a CI test. ADR-002 said "CI
+  users substitute their own"; the substitution mechanism doesn't
+  exist.
+- The synthetic fixture is **author-written** and presumably designed
+  alongside the ingester. There is no scrubbed third-party fixture.
+- Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace →
+  ingester → collator → 1-step `composer_total_loss`") is **unmet** —
+  the spike stops at "ingester emits TraceStates correctly." There is
+  no test that takes ingested records, runs the data collator, and runs
+  through `compose_loss`.
+This suite would catch real regressions in the ingester. Its weakest
+property: ships no contributor-runnable real-trace test.
+### Spike 008 (DiLoCo, 5 tests, single-process)
+**Verdict: the caveat is honest but says the test does not test what
+users will assume it tests.**
+BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps
+× 2 outer rounds on the toy model from Spike 005, both replicas converge
+toward the same solution within tolerance."*
+What ships: **one** replica, mock manager whose `allreduce` is a
+**`passthrough` no-op** (test_diloco_smoke.py:78). This is "one
+replica's outer optimizer machinery fires," not "two replicas
+converge." The acceptance criterion was silently re-defined; the
+spike's verdict.md calls this a "limitation" but it is a redefinition.
+The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real
+shared-buffer averaging. The implementation hits a "post-hook
+sequencing bug." **One of the recon claim and the implementation is
+wrong**, and the gap is buried in verdict.md instead of fixed.
+**Genuine value**: `test_diloco_pseudogradient_sign_convention` is the
+**single best test in all of Wave 7-10**. It pins the sign convention
+with a concrete arithmetic prediction (`final == θ_initial + nudge`)
+and reports `wrong_sign_diff` on failure. A future torchft upgrade that
+flips the sign breaks this test loudly. ADR-003 specifically flagged
+this hazard, and the test catches it. Credit where due.
+**Separate flaw in `composer_diloco.py` docstring (lines 13–28)**: the
+"wrong-sign pseudogradient combined with SGD's subtract-grad semantics
+gives net step in the local-Δ direction once momentum builds up" gloss
+is incoherent. There is no "wrong-sign" pseudogradient.
+`θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's
+`p ← p − lr·g` semantics are designed for it. The test is correct; the
+prose explaining why is wrong, and will mislead anyone porting the
+convention.
+---
+## (b) Is the package a real framework or a shim?
+**Verdict: a structured shim around three real components and two
+stubs. Not yet a framework.**
+What `pip install composer-replication` delivers:
+- `compose_loss` — labeled in its own top docstring as "Do NOT use as
+  the production training loss." Re-exported as the headline package
+  API and used in the quickstart.
+- `build_batch` — a hard-coded fixed-conversation factory built for the
+  smoke (factorial / binary-search examples). Anyone using this in
+  real training is using example code as production.
+- `ClaudeCodeIngester` — real, working component. Solid.
+- `generalized_jsd_loss` — real, working (extracted from OPSD, MIT).
+- `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but
+  require OpenRouter credentials + spend.
+- `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass.
+  Useful only with `[train]` extra. Not exercised end-to-end on any
+  real model in this repo.
+- `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra.
+What is missing for "pip install and start training":
+1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B.
+2. No CLI. `pyproject.toml` declares no `[project.scripts]`.
+3. No config schema (Hydra/Pydantic). Users hand-construct teacher
+   specs, hint generators, data collators.
+4. The `[train]` extra pulls TRL but **no integration test** of
+   `ComposerReplicationTrainer` against a real GRPO rollout exists in
+   this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out
+   precisely to avoid TRL.
+5. **`build_batch` should not be public API.** It belongs in
+   `examples/`. Re-exporting at top level implies it is a general-purpose
+   utility.
+6. **Two sources of truth**: `composer_replication/loss.py` is a
+   near-copy of `spikes/006-…/compose_loss.py` with one import path
+   changed. The spike tests still import from the spike file. A bug fix
+   in one will not propagate. Same for `composer_diloco.py` ↔
+   `composer_replication/diloco/__init__.py`.
+Real framework value:
+- `ClaudeCodeIngester` with non-trivial logic.
+- `generalized_jsd_loss` with token-clip + temperature.
+- DiLoCo wrapper with sign-pinning test.
+- Sane package layout with optional extras for heavy deps.
+Net: **a successful directory restructure plus an installable wrapper
+around three real components and two stubs.** Calling Wave 10 "framework
+is installable with working entrypoints (✅)" is letter-of-the-law;
+the brief's "framework" connotation isn't yet earned.
+---
+## (c) ADR defensibility
+### ADR-001 (local 5090 over Modal)
+**Reasoning defensible; execution missing.** The
+"iteration cycle 25–40s vs 3–5min" argument is concrete and matches
+reality. The "verification smoke, not production" framing is correct.
+**Gap**: Spike 002a-mini was never run on the 5090 either. Phase 10 in
+DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal,
+and **then nothing ran on either.** No `nvidia-smi` snapshot, no GPU
+step-time CSV, no bf16 numerics check. The "rule out CPU-only blind
+spots" goal is unmet. The ADR should be marked "Accepted (execution
+deferred)" or the spike should run.
+### ADR-002 (Claude Code JSONL trace source)
+**Defensible on every dimension the ADR considers; the dimensions are
+partial.** "1,015 real sessions, zero acquisition cost" is real. License
+and schema-stability arguments are well-sourced.
+**Adversarial counter not in the ADR**: Claude Code JSONL is the most
+self-serving choice. The framework targets training a coding-agent model.
+The training data is the author's own Claude Code sessions where the
+agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and
+*includes Claude*. So:
+- "student action" = what Claude did.
+- teacher pool includes Claude.
+- DPO pairs = teachers' agreement vs Claude's literal text.
+This is **circular imitation**: training a future model to imitate
+Claude using Claude's outputs as the gold reference and Claude as one
+of the disagreement teachers. The teacher-disagreement signal density
+argument from Spike 001 is strongest with diverse teachers. With this
+trace source, the student-action is locked to one teacher family,
+biasing the disagreement signal. The ADR doesn't consider this; the
+ingester README doesn't flag it. **The ADR rationalizes the easy path
+without naming the data-leakage tradeoff.**
+### ADR-003 (torchft for DiLoCo)
+**Genuinely defensible choice.** Meta-maintained library; rolling-own
+trap correctly identified; license analysis (rejecting `diloco_simple`)
+is right; sign-convention risk named and tested.
+**Gap is in delivery, not decision.** ADR-003 §Consequences §1 says:
+"2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer
+mock allreduce, assertions: replica equality after sync, params actually
+moved, Nesterov state populated, sync count matches expected." Spike 008
+implements one replica + passthrough manager. The ADR commits to an
+implementation that the spike does not deliver, and the gap is flagged
+only in the spike's verdict, not in the ADR.
+If the recon doc said the pattern was "ready-to-paste" but actually
+hits a sequencing bug, **the recon doc is wrong** and an adversarial
+reviewer is allowed to point that out.
+---
+## (d) Scorecard inflation
+The 5/10 → 9/10 update overstates. Test by test:
+- **Test 6 (DiLoCo integrated in runnable stack) → ✅?**
+  Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and
+  fires on one replica. **Zero references to torchft or DiLoCo in
+  `composer_trainer.py`** — DiLoCo is not integrated with the trainer.
+  No two-replica integration test, no real distributed run.
+- **Test 7 (real HF model loads + runs) → ✅?**
+  Yes — most legitimately closed item. Caveats from §(a) about depth
+  of evidence apply, but the literal test is met.
+- **Test 8 (real LLM-application trace ingested end-to-end) → ✅?**
+  Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion
+  #3 ("end-to-end: real trace → ingester → collator → 1-step
+  `composer_total_loss`") is unmet.**
+- **Test 9 (framework installable with working entrypoints) → ✅?**
+  Letter-of-law yes, spirit partial. `pip install -e .` works; the
+  quickstart runs the smoke harness. Production entrypoint
+  (`ComposerReplicationTrainer` driven by a config) does not exist.
+- **Test 10 (non-author can complete the journey) → ✅?**
+  No. The supporting evidence is "Quickstart README + working
+  installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0."
+  Test 10's original journey was "I have Qwen3-7B, I want a
+  Composer-style variant." The parenthetical concession in the update
+  ("For Qwen3-7B etc., GPU phase still gates the empirical demo")
+  ✅'s the item anyway.
+**Honest re-scoring**: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8),
+2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points.
+---
+## (e) Commit quality
+```
+ac05fbf Wave 10 — packaging: composer_replication is now pip-installable
+d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines)
+a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo
+57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8
+ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
+040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)
+```
+- `ac05fbf`, `d52e126`: accurate.
+- `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did
+  not actually run cleanly for anyone cloning before this commit. Mild
+  overclaim risk on `57af35d`.
+- **`57af35d` is the single most overclaiming commit.** Title: "close
+  vision-validation gaps V2/V5/V8."
+  - V8: closed in the weakest sense (tautology critique above).
+  - V5: structural ingestion closes; BACKLOG acceptance #3 unmet.
+  - V2: silently re-defined (one replica, no convergence).
+  Three closures claimed; one partial, one redefined.
+- **Chronology problem**: `040eff8` (Wave 6) declared the **5/10 → 9/10
+  forecast** in the commit subject. `ac4bfb4` (Wave 7, *next* commit)
+  added the BACKLOG and ADRs — i.e., the *plan* to make the forecast
+  true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without
+  re-auditing whether each item was actually closed in spirit. **No
+  commit re-audits the scorecard against actually delivered evidence.**
+---
+## (f) Adversarial reviewer's strongest line of attack
+> "You have a research replication framework whose only published smoke
+> is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO
+> channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel
+> uses dummy reference logprobs, and the GRPO channel is replaced with
+> a stub. Of the three channels you advertise, **zero are tested
+> end-to-end on a real HF model.** Your DiLoCo integration is one
+> replica with a no-op `allreduce`. Your real-trace ingester is tested
+> against a fixture you wrote yourself plus a hardcoded path on your
+> laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no
+> third-party validation, and one commit that closed three vision-
+> validation gaps with one commit message. You are asking the reader to
+> believe that a $9B-startup commercial product is replicated by a CPU
+> smoke and three green test files — none of which the company itself
+> would call 'replicated.'"
+**Weakest defense**: "It's just v0.1 / smoke phase / GPU is the next
+phase." The *commit log and scorecard claim otherwise.* The defense
+"v0.1 caveat" only works if the v0.1 framing is honest at the top of
+the README and scorecard — and it is not.
+**Strongest actual defense**: the four primary-source-validated recon
+docs and Spike 001's measured cost floor. The *thesis* is credible and
+auditable. The *implementation phase* is overclaimed.
+---
+## What to fix before publishing publicly (priority order)
+### 1. Re-state the scorecard honestly (BLOCKER)
+Replace 5/10 → 9/10 with **5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
+List the spirit-failures explicitly (test 6 trainer integration, test 8
+end-to-end, test 10 non-author). Single most important fix; everything
+else compounds on the inflated scorecard.
+### 2. Fix Spike 008's V2 claim (BLOCKER)
+Either (a) add a real two-replica multiprocessing test (ADR-003 says
+this is feasible; the spike claims it isn't — reconcile), or (b) mark
+V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica,
+sign convention pinned; cross-replica convergence deferred to GPU
+phase." Pick one.
+### 3. Strengthen Spike 006 against the tautology critique
+Two cheap wins:
+- Test that loss decreases on **two alternating fixed batches** over 10
+  rounds (not just one memorized batch).
+- Test where **`alpha_sdpo=10.0` and SDPO actually fires** (truncate
+  ctx_teacher to T_s tokens for matching shape). The SDPO channel is
+  *not exercised on a real HF model anywhere* in the codebase. Largest
+  evidence gap for V8.
+### 4. Run Spike 002a-mini on the local 5090
+ADR-001 made the choice; the spike was not run. Either drop the ADR
+(decision deferred) or run the spike (~30 min wall-clock per ADR's own
+estimate). Until then, the framework has zero GPU evidence of any kind.
+### 5. Fix the run.log / verdict.md numerical inconsistency
+Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090.
+Either pin the seed properly or document non-reproducibility and quote
+a band rather than exact numbers.
+### 6. Acknowledge Claude Code JSONL's circularity in ADR-002
+Add a "Risks accepted" entry naming the data-leakage concern: training
+on Claude's outputs while Claude is in the teacher pool produces a
+biased disagreement signal. Spike 007 README should also flag it.
+### 7. Decide what `compose_loss` and `build_batch` are
+Either rename to `compose_loss_smoke` (and keep
+`ComposerReplicationTrainer._compute_loss` as production), or make
+`compose_loss` actually production-grade and demote `build_batch` out
+of public API. Production-disclaimed harness as the package's headline
+import is confusing.
+### 8. Eliminate dual sources of truth
+`spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and
+`spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`.
+Make the spike import from the package; delete the duplicate.
+### 9. Add the missing real-trace end-to-end test in Spike 007
+Take ingester output → Spike 005 data collator → 1 step of `compose_loss`.
+This is BACKLOG acceptance #3; ~50 lines of test code closes V5's
+spirit gap.
+### 10. Fix the sign-convention docstring in `composer_diloco.py`
+Replace the incoherent "wrong-sign + SGD subtract = right answer with
+momentum" gloss with: *"DiLoCo defines pseudo-gradient as
+`θ_initial − θ_local`; this is the negative of the local update
+direction, and standard SGD subtracts gradients, so the outer step
+moves in the local-update direction. No negation required."* The test
+is correct; the prose explaining it isn't.
+---
+## Credit where due
+- **Spike 007's `ClaudeCodeIngester`** is real, working, well-tested
+  software with non-trivial logic (sidechain skip, thinking-block
+  strip-on-replay, malformed-line tolerance). The synthetic fixture
+  exercises the structural cases properly.
+- **Spike 008's pseudogradient-sign-convention test** is the single
+  best test in all of Wave 7-10. It pins a known torchft hazard with an
+  explicit arithmetic prediction and a `wrong_sign_diff` reported on
+  failure.
+- **Spike 006's α=0 / β=0 ablation tests** would catch real regressions
+  and document channel-disable semantics.
+- **All three ADRs are properly traceable to recon documents**
+  (MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE,
+  DILOCO_RECONNAISSANCE). The decisions can be challenged; the *process*
+  is auditable, which is rare.
+- **Package layout** (`loss`, `batch`, `opsd`, `teacher_replay`,
+  `ingestion/claude_code`, `diloco`, `trainer`) is sane; optional
+  extras correctly avoid forcing TRL/torchft on every install.
+The work product is not zero. It is overclaimed by roughly one
+scorecard tier and one BACKLOG acceptance criterion. Fixing items
+1, 2, 3, 5 above moves the framework from "publishable with a generous
+reviewer" to "publishable with a critical reviewer." Items 4 and 6
+move it from "research replication" to "evidenced research replication."

spikes/008-streaming-diloco/composer_diloco.py CHANGED Viewed

@@ -11,21 +11,25 @@ Wraps `torchft.local_sgd.DiLoCo` with the framework's conventions:
 Reference: `docs/adrs/ADR-003-diloco-impl.md`.
 Sign convention (READ THIS BEFORE TOUCHING):
-    torchft's `_save_grads()` (line 324 of torchft/local_sgd.py) computes
-        grad = θ_initial - θ_local
-    and stores it as `param.grad` for the outer optimizer to consume.
-    The outer optimizer then runs `param.data -= lr * grad`, equivalently
-        θ_new = θ_local + lr * (θ_initial - θ_local)  if outer optimizer is plain SGD
-    which slurps the local-trained-θ TOWARD the initial-θ instead of away
-    from it. That looks wrong, but it's correct for SGD-with-Nesterov-momentum
-    on outer loop: the outer optimizer accumulates the negative-grad-direction
-    history, so the "wrong-sign" pseudogradient combined with SGD's "subtract
-    grad" semantics gives net "step in the local-Δ direction" once momentum
-    builds up. This is consistent with the DiLoCo paper's pseudo-code.
-    Bottom line: do NOT negate. torchft's pseudogradient sign + SGD outer
-    optimizer is the correct combination. Spike 008's
-    `test_diloco_pseudogradient_sign_convention` test catches a sign flip.
 """
 from __future__ import annotations

 Reference: `docs/adrs/ADR-003-diloco-impl.md`.
 Sign convention (READ THIS BEFORE TOUCHING):
+    DiLoCo defines pseudo-gradient as
+        pseudograd = θ_initial - θ_local
+    (per torchft's `_save_grads()` at line 324 of `torchft/local_sgd.py`).
+    This is the **negative** of the local update direction.
+    Standard SGD subtracts gradients: `p.data ← p.data - lr * grad`.
+    So when the outer optimizer runs after `restore_parameters()` puts
+    p.data back to θ_initial:
+        p.data ← θ_initial - lr * (θ_initial - θ_local)
+               = θ_initial + lr * (θ_local - θ_initial)
+    For `lr=1, momentum=0` this lands exactly at θ_local. For `lr<1` it
+    interpolates between θ_initial and θ_local. Nesterov momentum
+    accumulates the local-update direction across outer rounds.
+    No negation in our outer optimizer wrapper.
 """
 from __future__ import annotations

spikes/008-streaming-diloco/verdict.md CHANGED Viewed

@@ -1,29 +1,62 @@
 # Spike 008 — VERDICT
-**Status**: ✅ PASSED
 **Date**: 2026-05-26
 **Wave**: 9
 ## Headline
 `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained)
-to integrate vanilla DiLoCo / Streaming DiLoCo as the outer-loop optimizer for the
-Composer Replication Framework. 5/5 unit tests pass single-process. Sign convention
-of pseudo-gradient pinned down by an explicit unit test.
-## Acceptance criteria
 | Criterion | Status |
 |---|---|
-| Outer loop machinery fires (allreduce + start_quorum + outer step) | ✅ test 1 |
 | Nesterov momentum state populated for every parameter | ✅ test 1 |
-| Pseudo-gradient sign convention verified (`θ_initial − θ_local`) | ✅ test 2 |
 | No regression in Spike 005 imports | ✅ test 3 |
 | `make_diloco_outer_loop()` factory wraps the right object | ✅ test 4 |
-| Streaming DiLoCo with 2 fragments constructs cleanly | ✅ test 5 |
-| Spike 005's 38 tests still pass | ✅ verified separately |
-## Sign convention pinned down (the most important result)
 Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
@@ -31,43 +64,43 @@ Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
 pseudograd = θ_initial − θ_local
 ```
-The outer optimizer then runs `p.data ← θ_initial − lr * pseudograd`. With
-`lr=1, momentum=0`, this resolves to `θ_local` (the outer step undoes the
-restore-to-θ_initial). The test exercises this exact math with
-`local_param_after_nudge = θ_initial + 0.5` and asserts final ≈ θ_local.
-A sign flip in either `_save_grads` or the outer optimizer would land us at
-`θ_initial - 0.5` (movement in the wrong direction). The test reports both
-values in the failure message so a future flip is immediately diagnosable.
-## What this closes
-- **V2** (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md` — promotes
-  DiLoCo from "documented gap" to "real working integration with sign-convention
-  tested."
 ## What this does NOT close
-- True multi-replica convergence in single-process. The recon doc's pattern of
-  "real averaging across replicas via shared buffer" hits a sequencing bug:
-  replica A's `inner.step()` post-hook completes the entire prepare→perform
-  sync sequence BEFORE replica B's post-hook starts, so the cross-replica
-  average can't complete in time for A's outer step. This is the SAME
-  limitation torchft's own tests have — they don't test convergence in
-  single-process either. True cross-replica convergence is verified in
-  production by NCCL with two real processes. For now, single-process tests
-  verify the *machinery* (sync fires, outer optimizer steps, Nesterov state
-  populates).
-- Streaming DiLoCo with `fragment_sync_delay > 0` and overlapped sync
-  (requires CUDA streams). The framework's `make_diloco_outer_loop()` accepts
-  the parameter; Spike 008 exercises only `delay=0` (vanilla DiLoCo).
 ## Files
-- `composer_diloco.py` — `make_diloco_outer_loop()` wrapper. Documents the
-  sign convention LOUDLY (per ADR-003).
-- `tests/test_diloco_smoke.py` — 5 acceptance tests.
 ## Dependencies added
@@ -76,4 +109,4 @@ values in the failure message so a future flip is immediately diagnosable.
 ## Cost / time
 - Pure CPU, single process, no GPU.
-- Test suite: 4.7 seconds for 5 tests.

 # Spike 008 — VERDICT
+**Status**: ⚠️ PARTIAL (acceptance criterion redefined; see § "Honest re-statement" below)
 **Date**: 2026-05-26
 **Wave**: 9
+**Cross-model review**: BLOCKER 2 of `docs/research/WAVE_7_10_FINAL_REVIEW.md`
 ## Headline
 `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained)
+and the framework's outer-loop sign convention is pinned by an explicit unit
+test. **Cross-replica convergence is NOT verified** — the BACKLOG acceptance
+criterion required two replicas; what shipped is one replica + passthrough
+no-op `allreduce`.
+## Honest re-statement
+The BACKLOG required:
+> Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model
+> from Spike 005, both replicas converge toward the same solution within
+> tolerance.
+What was attempted: the recon doc (`DILOCO_RECONNAISSANCE.md`) provided a
+"ready-to-paste" 2-replica pattern with a shared-buffer mock allreduce that
+averages tensors across replicas. **That pattern does not work in single-process**:
+each `inner.step()`'s post-hook runs `prepare_sync` + `perform_sync` to
+completion (including outer optimizer step) before yielding back to the
+caller. By the time replica B's post-hook starts, replica A has already
+finished its outer step using A's *un*-averaged pseudo-gradient. The mock
+allreduce can compute the cross-replica mean, but it can't write that mean
+back into A's `_grads[name]` buffer in time for A's outer step.
+The fix would require either:
+- A real `torch.distributed` barrier (NCCL or Gloo) — out of scope for a
+  CPU-only single-process smoke.
+- A multi-process test using `torch.multiprocessing.spawn` with two real
+  processes — feasible but ~200 LOC of additional test infrastructure that
+  would need its own review.
+What ships instead: a single-replica machinery test (`allreduce` is a no-op
+passthrough). Verifies that outer optimizer fires, Nesterov state populates,
+sign convention is correct. Does NOT verify cross-replica convergence.
+**This is a redefinition of the BACKLOG acceptance criterion.** Documented
+explicitly in this verdict + the test file's `_make_passthrough_manager`
+docstring + composer_diloco.py.
+## What the test suite DOES verify (5/5 pass)
 | Criterion | Status |
 |---|---|
+| `allreduce`/`start_quorum`/`should_commit` fire at the right step boundaries | ✅ test 1 |
 | Nesterov momentum state populated for every parameter | ✅ test 1 |
+| **Pseudo-gradient sign convention** verified (`θ_initial − θ_local`) with explicit arithmetic prediction | ✅ test 2 |
 | No regression in Spike 005 imports | ✅ test 3 |
 | `make_diloco_outer_loop()` factory wraps the right object | ✅ test 4 |
+| Streaming DiLoCo with 2 fragments + nonzero `fragment_sync_delay` accepts the config | ✅ test 5 |
+## Sign convention pinned (the most important result here)
 Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
 pseudograd = θ_initial − θ_local
 ```
+DiLoCo defines pseudo-gradient as `θ_initial - θ_local`. This is the
+negative of the local update direction. Standard SGD subtracts gradients
+(`p ← p - lr * grad`), so the outer step moves in the local-update
+direction. No negation needed in our outer optimizer wrapper.
+The test exercises this exact math with `local_param_after_nudge =
+θ_initial + 0.5` and asserts final ≈ `θ_local`. A sign flip in either
+`_save_grads` or the outer optimizer would land at `θ_initial - 0.5`
+(movement in the wrong direction); the test reports both values in the
+failure message so a future flip is immediately diagnosable. **This is
+the single best test in Wave 7-10** per the cross-model reviewer.
+## What this CLAIMS to close
+- **V2** (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md` —
+  re-scored as **⚠️ partial**, not ✅, in the 2026-05-26 update at the
+  bottom of § 3 of that doc.
 ## What this does NOT close
+- **True multi-replica convergence** — see § "Honest re-statement" above.
+  Either needs a real `torch.distributed` test on multiple processes, or
+  a redesigned single-process pattern that overrides `_DummyWork.wait()`
+  to do lazy averaging.
+- **Trainer integration** — `ComposerReplicationTrainer` does NOT yet use
+  `make_diloco_outer_loop`. The DiLoCo wrapper is an independent context
+  manager. Wiring it into the trainer's lifecycle is a separate spike.
+- **Streaming DiLoCo with `fragment_sync_delay > 0`** (overlapped sync,
+  CUDA streams). The framework's `make_diloco_outer_loop()` accepts the
+  parameter; tests only exercise `delay=0` (vanilla DiLoCo).
 ## Files
+- `composer_diloco.py` — `make_diloco_outer_loop()` wrapper. Documents
+  the sign convention.
+- `tests/test_diloco_smoke.py` — 5 acceptance tests. Test 2 (sign
+  convention) is the highest-value test.
 ## Dependencies added
 ## Cost / time
 - Pure CPU, single process, no GPU.
+- Test suite: ~5 seconds for 5 tests.