Codeseys commited on
Commit
f16fa23
·
1 Parent(s): ac05fbf

Wave 11: cross-model adversarial review + honest down-revision

Browse files

Phase 11 of the deep work loop ran a cross-model adversarial review on the
Wave 7-10 work. The review (anthropic/claude-opus-4.7, separate session) found
10 substantive flaws including 2 BLOCKERs:

1. **Scorecard inflation** — initial self-claim was 5/10 → 9/10, honest
re-scoring is 5/10 → 7/10 ✅ + 1/10 ⚠️ + 2/10 ❌-spirit.
2. **Spike 008 acceptance criterion silently redefined** — BACKLOG required
"2 replicas converge"; what shipped is "1 replica + passthrough no-op
allreduce." The recon doc's "ready-to-paste 2-replica pattern" hits a
single-process post-hook sequencing bug.

Both BLOCKERs addressed in this commit:

- docs/research/WAVE_7_10_FINAL_REVIEW.md — full 423-line review committed.
- docs/VISION_VALIDATION.md — replace 9/10 self-claim with honest 7/10
re-scoring + per-test caveats acknowledging the spirit-failures of tests
6/8/10. Explicitly notes "the framework as of this commit has zero GPU
evidence of any kind."
- spikes/008-streaming-diloco/verdict.md — status changed from "✅ PASSED"
to "⚠️ PARTIAL." Adds a "Honest re-statement" section explaining the
acceptance-criterion redefinition and what would be needed to close the
BACKLOG criterion in spirit (multi-process test ~200 LOC).
- README.md — Spike 6 marked "PASSED with caveat" (SDPO channel never
exercises on a real model in the smoke). Spike 7 acknowledges the
end-to-end ingester→collator→loss test is missing. Spike 8 marked ⚠️
PARTIAL not 🟢 PASSED.

Two cheaper improvements also landed:

- docs/adrs/ADR-002-trace-source.md — adds a "Risk added 2026-05-26 by
cross-model review" section flagging the Claude Code circularity:
training a student on Claude's outputs while Claude is in the teacher
pool produces biased disagreement. Mitigation (drop Claude from the
pool when ingesting Claude traces) documented; not yet enforced in code.
- composer_replication/diloco/__init__.py + spikes/008-.../composer_diloco.py
— sign-convention docstrings rewritten. Old prose ("wrong-sign +
SGD subtract = right answer with momentum") was incoherent. New version
derives p.data ← θ_initial - lr*pseudograd = θ_initial + lr*(θ_local -
θ_initial) cleanly, no hand-waving about momentum.

Items NOT addressed in this commit (deferred to next session or post-GPU):
- 3. Strengthen Spike 006 against tautology (use 2 alternating batches +
real SDPO firing path)
- 4. Run Spike 002a-mini on local 5090 (zero GPU evidence anywhere yet)
- 5. Reconcile run.log vs verdict.md numerical inconsistency (seed pinning)
- 7. Decide what compose_loss/build_batch are (verification harness vs
production)
- 8. Eliminate dual sources of truth (spike copy vs package copy)
- 9. Add the missing real-trace end-to-end test in Spike 007

The deep work loop's design says "publish the cross-model review with the
work, not before reviewing." This commit honors that — the BLOCKER fixes
land BEFORE the public push.

Tests still 67/67 across spikes 005/007/008 (Spike 008 changes are
docstring-only, machinery test still passes).

README.md CHANGED
@@ -48,10 +48,10 @@ for what the output should look like.
48
  **v0.1 spike progress (2026-05-26):**
49
  - 🟢 Spike 001 (kill-switch teacher cost) — **VALIDATED**: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
50
  - 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
51
- - 🟢 Spike 006 (real HF model smoke) — **PASSED**: Qwen2.5-0.5B-Instruct via `AutoModelForCausalLM`, 5 backward steps on CPU, loss 0.7390 → 0.0031 (99.6% reduction), all gradients finite. Closes vision-validation gap V8.
52
- - 🟢 Spike 007 (real trace ingestion) — **PASSED**: `ClaudeCodeIngester.ingest()` converts Claude Code session JSONL → `TraceState` records. 15/15 tests including a real-session smoke. Closes V5.
53
- - 🟢 Spike 008 (DiLoCo outer-loop smoke) — **PASSED**: `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained). 5/5 tests including pseudo-gradient sign-convention verification. Closes V2.
54
- - 🟢 Wave 10 (packaging) — **DONE**: `pip install -e .` works; `composer_replication` package re-exports the verified APIs from the spike directories.
55
  - 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
56
 
57
  📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
 
48
  **v0.1 spike progress (2026-05-26):**
49
  - 🟢 Spike 001 (kill-switch teacher cost) — **VALIDATED**: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
50
  - 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
51
+ - 🟢 Spike 006 (real HF model smoke) — **PASSED with caveat**: Qwen2.5-0.5B-Instruct via `AutoModelForCausalLM`, 5 backward steps on CPU, loss 0.7390 → 0.0031, all gradients finite. Closes vision-validation gap V8 in the literal sense. Caveat: the SDPO channel is `0.0` throughout (silently disabled by ctx-shape mismatch — the fallback is correct but means SDPO is not exercised end-to-end on a real model anywhere yet) and DPO uses dummy reference logprobs.
52
+ - 🟢 Spike 007 (real trace ingestion) — **PASSED**: `ClaudeCodeIngester.ingest()` converts Claude Code session JSONL → `TraceState` records. 15/15 tests including a real-session smoke. ⚠️ End-to-end ingester → collator → loss test still missing (V5 spirit gap); see VISION_VALIDATION § 3 update.
53
+ - ⚠️ Spike 008 (DiLoCo outer-loop smoke) — **PARTIAL**: `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo`. 5/5 single-process tests pass including a pseudo-gradient sign-convention pin. **But** the BACKLOG required a 2-replica convergence smoke; what shipped is 1-replica machinery + passthrough no-op `allreduce`. True multi-process DiLoCo is GPU-gated and not yet attempted.
54
+ - 🟢 Wave 10 (packaging) — **DONE**: `pip install -e .` works; `composer_replication` package re-exports the verified APIs from the spike directories. `compose_loss` and `build_batch` are explicitly verification-harness public APIs (production loss is `ComposerReplicationTrainer._compute_loss`).
55
  - 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
56
 
57
  📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
composer_replication/diloco/__init__.py CHANGED
@@ -11,21 +11,31 @@ Wraps `torchft.local_sgd.DiLoCo` with the framework's conventions:
11
  Reference: `docs/adrs/ADR-003-diloco-impl.md`.
12
 
13
  Sign convention (READ THIS BEFORE TOUCHING):
14
- torchft's `_save_grads()` (line 324 of torchft/local_sgd.py) computes
15
- grad = θ_initial - θ_local
16
- and stores it as `param.grad` for the outer optimizer to consume.
17
- The outer optimizer then runs `param.data -= lr * grad`, equivalently
18
- θ_new = θ_local + lr * (θ_initial - θ_local) if outer optimizer is plain SGD
19
- which slurps the local-trained-θ TOWARD the initial-θ instead of away
20
- from it. That looks wrong, but it's correct for SGD-with-Nesterov-momentum
21
- on outer loop: the outer optimizer accumulates the negative-grad-direction
22
- history, so the "wrong-sign" pseudogradient combined with SGD's "subtract
23
- grad" semantics gives net "step in the local-Δ direction" once momentum
24
- builds up. This is consistent with the DiLoCo paper's pseudo-code.
25
-
26
- Bottom line: do NOT negate. torchft's pseudogradient sign + SGD outer
27
- optimizer is the correct combination. Spike 008's
28
- `test_diloco_pseudogradient_sign_convention` test catches a sign flip.
 
 
 
 
 
 
 
 
 
 
29
  """
30
  from __future__ import annotations
31
 
 
11
  Reference: `docs/adrs/ADR-003-diloco-impl.md`.
12
 
13
  Sign convention (READ THIS BEFORE TOUCHING):
14
+ DiLoCo defines pseudo-gradient as
15
+
16
+ pseudograd = θ_initial - θ_local
17
+
18
+ (per torchft's `_save_grads()` at line 324 of `torchft/local_sgd.py`).
19
+ This is the **negative** of the local update direction (the local
20
+ update *moved* params from θ_initial toward θ_local).
21
+
22
+ Standard SGD subtracts gradients: `p.data p.data - lr * grad`.
23
+ So when the outer optimizer runs after `restore_parameters()` puts
24
+ p.data back to θ_initial:
25
+
26
+ p.data θ_initial - lr * (θ_initial - θ_local)
27
+ = θ_initial + lr * (θ_local - θ_initial)
28
+
29
+ For `lr=1, momentum=0` this lands exactly at θ_local. For `lr<1` it
30
+ interpolates between θ_initial and θ_local (the standard DiLoCo outer
31
+ step). Adding Nesterov momentum accumulates the local-update direction
32
+ across outer rounds.
33
+
34
+ No negation in our outer optimizer wrapper. The test
35
+ `test_diloco_pseudogradient_sign_convention` in
36
+ `spikes/008-streaming-diloco/tests/test_diloco_smoke.py` pins this
37
+ arithmetic and reports both the expected and the wrong-sign value on
38
+ failure for fast diagnosis if torchft ever flips its convention.
39
  """
40
  from __future__ import annotations
41
 
docs/VISION_VALIDATION.md CHANGED
@@ -59,6 +59,24 @@ Ten concrete pass/fail tests covering both "do we encapsulate the vision" and "i
59
 
60
  **Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## 4. The four real gaps, each examined
63
 
64
  ### 4.1 V2: DiLoCo deferral — is this a drift?
 
59
 
60
  **Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
61
 
62
+ > **Update 2026-05-26 — Wave 7+8+9+10 closeout (deep work loop) + cross-model audit**
63
+ >
64
+ > Initial self-claim was 5/10 → 9/10. A cross-model adversarial review (Phase 11 of the deep work loop, doc at `docs/research/WAVE_7_10_FINAL_REVIEW.md`) found three of those ✅s were letter-of-the-law rather than spirit. **Honest re-scoring: 5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
65
+ >
66
+ > | # | Test | Status | New evidence + honest caveat |
67
+ > |---|---|---|---|
68
+ > | 6 | DiLoCo integrated in runnable stack | ⚠️ partial | Spike 008 has `composer_replication.diloco.make_diloco_outer_loop` wrapping `torchft.local_sgd.DiLoCo`, with a 5/5 test suite that pins the sign convention. **But**: the BACKLOG required a *2-replica convergence* smoke; what shipped is a 1-replica machinery test with a passthrough no-op `allreduce`. The recon doc's "ready-to-paste 2-replica pattern" hits a single-process post-hook sequencing bug we couldn't fix without rewriting torchft. The DiLoCo wrapper is also **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. Calling V2 ✅ overstates: real DiLoCo training is GPU-multi-process which we haven't touched. |
69
+ > | 7 | Real HF model loads + runs through `compose_loss` | ✅ (with caveat below) | Spike 006 — Qwen2.5-0.5B-Instruct on CPU, 5 backward steps, loss 0.7390 → 0.0031, 9/9 tests. **Caveat**: SDPO channel is `0.0` throughout (silently disabled by ctx_student vs ctx_teacher shape mismatch — correct fallback, but means SDPO is not exercised end-to-end on a real model anywhere in the repo yet). DPO uses dummy reference logprobs. The 5-step loss decrease on a fixed batch is closer to "memorization works" than "the 3-channel composition is correct." Still: the framework now demonstrably loads real HF models, which it didn't before. V8 is closed in the literal sense. |
70
+ > | 8 | Real LLM-application trace ingested end-to-end | ❌ spirit | Spike 007 — `ClaudeCodeIngester` ingests real Claude Code session JSONL → `TraceState` records, 15/15 tests including a real-session smoke. **But**: BACKLOG acceptance criterion #3 said "end-to-end smoke: real trace → ingester → collator → 1-step `compose_loss`." That last hop is **not tested**. The spike stops at "ingester emits TraceStates correctly." Closing V5 in spirit needs a 50-LOC test that pipes ingested records all the way through the loss. Open. |
71
+ > | 9 | Framework is *installable* with working entrypoints | ✅ | Wave 10 — `pyproject.toml` ships `composer_replication` package, `pip install -e .` works, `examples/qwen_05b_quickstart/run.py` runs end-to-end via the package API. (Caveat acknowledged: `compose_loss` is documented as a verification harness, not production. The production loss is `ComposerReplicationTrainer._compute_loss`.) |
72
+ > | 10 | Non-author can complete the "I have X, I want a Composer variant" journey | ❌ spirit | Quickstart works for "verify the loss composition runs" but not for "train a real model" — that requires real GRPO rollouts, real teacher calls, and GPU. The brief's intended user wants the latter. We have not closed that path. |
73
+ >
74
+ > **The remaining 1/10 + 2/10 spirit gaps + the unverified 9/10 ⚠️** are the post-replication GPU phase: Spike 002a/b (real trace collection on GPU), Spike 003 (DPO-pair signal density), Spike 004 (A/B SWE-bench-lite), and a real-multi-process DiLoCo test. Those are GPU-budget-gated and out of scope for the deep work loop's CPU-only constraint.
75
+ >
76
+ > **Time spent on Wave 7-10**: ~1 session. **No GPU spend.** Modal evaluated but rejected for the smoke phase (ADR-001 — local 5090 wins on iteration cycle 10× over Modal L4 for 0.5B verification work). **The local 5090 was also not used** — Spike 002a-mini (the planned local-GPU smoke) was not run. The framework as of this commit has zero GPU evidence of any kind. That is honest about where this work lands: **a tested, installable methodology repo with real CPU smokes and primary-source-validated research, not a trained model.**
77
+ >
78
+ > Cross-model review's full priority list (10 items, ranked) is in `docs/research/WAVE_7_10_FINAL_REVIEW.md`. Items 1-2 (scorecard honesty + V2 re-statement) are addressed by this update. Items 3-10 are open.
79
+
80
  ## 4. The four real gaps, each examined
81
 
82
  ### 4.1 V2: DiLoCo deferral — is this a drift?
docs/adrs/ADR-002-trace-source.md CHANGED
@@ -117,6 +117,25 @@ Wins on every axis we care about for Spike 007:
117
  need to bump a `schema_version` constant in the ingester. Acceptable
118
  ongoing maintenance burden.
119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  ### Future ingesters
121
 
122
  Open the door for two more ingesters in v0.2:
 
117
  need to bump a `schema_version` constant in the ingester. Acceptable
118
  ongoing maintenance burden.
119
 
120
+ ### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT)
121
+
122
+ - **Circularity / data-leakage in the teacher-replay channel.** Claude
123
+ Code traces are produced by Claude. Our default teacher pool
124
+ (`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a
125
+ student on Claude's outputs while Claude is one of the teachers
126
+ voting on what the student should do produces a biased disagreement
127
+ signal: Claude's vote is correlated with the trace's existing
128
+ `student_action` (which Claude originally produced). This biases the
129
+ multi-teacher consensus toward the existing answer.
130
+ - **Mitigation**: when ingesting Claude Code traces, the user should
131
+ drop Claude from the teacher pool and use a non-Claude consensus
132
+ (Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair).
133
+ Documented here; not yet enforced in code.
134
+ - **Open question for v0.2**: should `ClaudeCodeIngester` automatically
135
+ annotate the source-model field on each trace and `replay_trace`
136
+ automatically exclude same-family teachers? Defer the design until
137
+ the post-replication phase reveals whether the bias is observable.
138
+
139
  ### Future ingesters
140
 
141
  Open the door for two more ingesters in v0.2:
docs/research/WAVE_7_10_FINAL_REVIEW.md ADDED
@@ -0,0 +1,423 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Wave 7–10 Final Review — Cross-model adversarial check
2
+
3
+ **Reviewer**: external model, Phase 11 of the deep work loop.
4
+ **Date**: 2026-05-26.
5
+ **Mandate**: find substantive flaws. The research thesis is
6
+ primary-source-validated; this attacks *implementation correctness* and
7
+ *scope creep*, not the thesis.
8
+
9
+ ---
10
+
11
+ ## (a) Are the tests real evidence or theater?
12
+
13
+ ### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests)
14
+
15
+ **Verdict: mostly tautology, with usable ablation tests.**
16
+
17
+ The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is
18
+ technically true and substantively near-tautological:
19
+
20
+ 1. **The same fixed ~50-token batch is reused for all 5 steps.**
21
+ `build_batch` returns one conversation; the test loop calls
22
+ `compose_loss(model, batch)` five times in a row. No reshuffle, no
23
+ second batch, no held-out anything.
24
+ 2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch ×
25
+ 5 steps = textbook memorization regime.** A randomly-initialized MLP
26
+ would also reduce loss in this setup. The test does not distinguish
27
+ "the 3-channel composition is correct" from "AdamW reduces fixed-batch
28
+ loss on any non-degenerate objective."
29
+ 3. **The SDPO channel is zero throughout** (`sdpo_jsd=0.0` on every
30
+ row of `loss_curve.csv`). The verdict calls this "correct fallback
31
+ behavior"; what it actually is is *the entire SDPO channel never
32
+ being tested by this smoke*. The fallback is a literal `_zero(device)`.
33
+ `generalized_jsd_loss` has no end-to-end test on a real HF model
34
+ anywhere in the codebase. **This is the largest evidence gap for
35
+ V8.**
36
+ 4. **DPO uses dummy hard-coded reference logprobs** (`-30.0`, `-35.0`).
37
+ This tests that `-logsigmoid(small_positive)` is differentiable, not
38
+ that the trace-replay-DPO pipeline (reference-policy precompute +
39
+ collator + loss) wires together.
40
+ 5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the
41
+ weakest version of monotonicity.
42
+
43
+ **Genuine value**: model-loads test, chat-template test, the three
44
+ α=0 / β=0 ablation tests. The ablations would catch a regression where
45
+ weights stop disabling channels. The "5-step decrease" is the weakest
46
+ test in the file.
47
+
48
+ **Run.log inconsistency, not flagged anywhere**:
49
+ `examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379;
50
+ the spike `verdict.md` quotes step-1 total = 0.2090 for the same code,
51
+ same model. Either the seed isn't pinned through the model forward
52
+ (likely — `torch.manual_seed(42)` is in `build_batch` only), or the
53
+ package's `compose_loss` differs subtly from the spike's. **Quoting
54
+ exact numbers from a non-reproducible run as evidence is the sloppy
55
+ version of every research-replication scandal.**
56
+
57
+ ### Spike 007 (Claude Code ingester, 15 tests)
58
+
59
+ **Verdict: strongest test suite of the three. Caveats apply.**
60
+
61
+ Real engineering value:
62
+ - Synthetic fixture exercises the actual record types (assistant,
63
+ user/tool_result, summary, system, sidechain). Tests assert structural
64
+ properties: history grows monotonically, `[THINKING]` stripped on
65
+ replay but kept in student_action, unique state_ids, tool_use
66
+ serialization, tool_result tagging.
67
+ - `test_truncated_line_tolerated` would catch a real failure-mode
68
+ removal of the JSON-decode try/except.
69
+ - Subagent and sidechain skip tests catch real production cases.
70
+
71
+ Caveats:
72
+ - **The "real session" test is hardcoded to one path on the author's
73
+ machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`).
74
+ No env var, no fixture-discovery; the test is `skipif(not exists)`.
75
+ This is a manual integration test, not a CI test. ADR-002 said "CI
76
+ users substitute their own"; the substitution mechanism doesn't
77
+ exist.
78
+ - The synthetic fixture is **author-written** and presumably designed
79
+ alongside the ingester. There is no scrubbed third-party fixture.
80
+ - Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace →
81
+ ingester → collator → 1-step `composer_total_loss`") is **unmet** —
82
+ the spike stops at "ingester emits TraceStates correctly." There is
83
+ no test that takes ingested records, runs the data collator, and runs
84
+ through `compose_loss`.
85
+
86
+ This suite would catch real regressions in the ingester. Its weakest
87
+ property: ships no contributor-runnable real-trace test.
88
+
89
+ ### Spike 008 (DiLoCo, 5 tests, single-process)
90
+
91
+ **Verdict: the caveat is honest but says the test does not test what
92
+ users will assume it tests.**
93
+
94
+ BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps
95
+ × 2 outer rounds on the toy model from Spike 005, both replicas converge
96
+ toward the same solution within tolerance."*
97
+
98
+ What ships: **one** replica, mock manager whose `allreduce` is a
99
+ **`passthrough` no-op** (test_diloco_smoke.py:78). This is "one
100
+ replica's outer optimizer machinery fires," not "two replicas
101
+ converge." The acceptance criterion was silently re-defined; the
102
+ spike's verdict.md calls this a "limitation" but it is a redefinition.
103
+
104
+ The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real
105
+ shared-buffer averaging. The implementation hits a "post-hook
106
+ sequencing bug." **One of the recon claim and the implementation is
107
+ wrong**, and the gap is buried in verdict.md instead of fixed.
108
+
109
+ **Genuine value**: `test_diloco_pseudogradient_sign_convention` is the
110
+ **single best test in all of Wave 7-10**. It pins the sign convention
111
+ with a concrete arithmetic prediction (`final == θ_initial + nudge`)
112
+ and reports `wrong_sign_diff` on failure. A future torchft upgrade that
113
+ flips the sign breaks this test loudly. ADR-003 specifically flagged
114
+ this hazard, and the test catches it. Credit where due.
115
+
116
+ **Separate flaw in `composer_diloco.py` docstring (lines 13–28)**: the
117
+ "wrong-sign pseudogradient combined with SGD's subtract-grad semantics
118
+ gives net step in the local-Δ direction once momentum builds up" gloss
119
+ is incoherent. There is no "wrong-sign" pseudogradient.
120
+ `θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's
121
+ `p ← p − lr·g` semantics are designed for it. The test is correct; the
122
+ prose explaining why is wrong, and will mislead anyone porting the
123
+ convention.
124
+
125
+ ---
126
+
127
+ ## (b) Is the package a real framework or a shim?
128
+
129
+ **Verdict: a structured shim around three real components and two
130
+ stubs. Not yet a framework.**
131
+
132
+ What `pip install composer-replication` delivers:
133
+ - `compose_loss` — labeled in its own top docstring as "Do NOT use as
134
+ the production training loss." Re-exported as the headline package
135
+ API and used in the quickstart.
136
+ - `build_batch` — a hard-coded fixed-conversation factory built for the
137
+ smoke (factorial / binary-search examples). Anyone using this in
138
+ real training is using example code as production.
139
+ - `ClaudeCodeIngester` — real, working component. Solid.
140
+ - `generalized_jsd_loss` — real, working (extracted from OPSD, MIT).
141
+ - `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but
142
+ require OpenRouter credentials + spend.
143
+ - `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass.
144
+ Useful only with `[train]` extra. Not exercised end-to-end on any
145
+ real model in this repo.
146
+ - `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra.
147
+
148
+ What is missing for "pip install and start training":
149
+ 1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B.
150
+ 2. No CLI. `pyproject.toml` declares no `[project.scripts]`.
151
+ 3. No config schema (Hydra/Pydantic). Users hand-construct teacher
152
+ specs, hint generators, data collators.
153
+ 4. The `[train]` extra pulls TRL but **no integration test** of
154
+ `ComposerReplicationTrainer` against a real GRPO rollout exists in
155
+ this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out
156
+ precisely to avoid TRL.
157
+ 5. **`build_batch` should not be public API.** It belongs in
158
+ `examples/`. Re-exporting at top level implies it is a general-purpose
159
+ utility.
160
+ 6. **Two sources of truth**: `composer_replication/loss.py` is a
161
+ near-copy of `spikes/006-…/compose_loss.py` with one import path
162
+ changed. The spike tests still import from the spike file. A bug fix
163
+ in one will not propagate. Same for `composer_diloco.py` ↔
164
+ `composer_replication/diloco/__init__.py`.
165
+
166
+ Real framework value:
167
+ - `ClaudeCodeIngester` with non-trivial logic.
168
+ - `generalized_jsd_loss` with token-clip + temperature.
169
+ - DiLoCo wrapper with sign-pinning test.
170
+ - Sane package layout with optional extras for heavy deps.
171
+
172
+ Net: **a successful directory restructure plus an installable wrapper
173
+ around three real components and two stubs.** Calling Wave 10 "framework
174
+ is installable with working entrypoints (✅)" is letter-of-the-law;
175
+ the brief's "framework" connotation isn't yet earned.
176
+
177
+ ---
178
+
179
+ ## (c) ADR defensibility
180
+
181
+ ### ADR-001 (local 5090 over Modal)
182
+
183
+ **Reasoning defensible; execution missing.** The
184
+ "iteration cycle 25–40s vs 3–5min" argument is concrete and matches
185
+ reality. The "verification smoke, not production" framing is correct.
186
+
187
+ **Gap**: Spike 002a-mini was never run on the 5090 either. Phase 10 in
188
+ DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal,
189
+ and **then nothing ran on either.** No `nvidia-smi` snapshot, no GPU
190
+ step-time CSV, no bf16 numerics check. The "rule out CPU-only blind
191
+ spots" goal is unmet. The ADR should be marked "Accepted (execution
192
+ deferred)" or the spike should run.
193
+
194
+ ### ADR-002 (Claude Code JSONL trace source)
195
+
196
+ **Defensible on every dimension the ADR considers; the dimensions are
197
+ partial.** "1,015 real sessions, zero acquisition cost" is real. License
198
+ and schema-stability arguments are well-sourced.
199
+
200
+ **Adversarial counter not in the ADR**: Claude Code JSONL is the most
201
+ self-serving choice. The framework targets training a coding-agent model.
202
+ The training data is the author's own Claude Code sessions where the
203
+ agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and
204
+ *includes Claude*. So:
205
+ - "student action" = what Claude did.
206
+ - teacher pool includes Claude.
207
+ - DPO pairs = teachers' agreement vs Claude's literal text.
208
+
209
+ This is **circular imitation**: training a future model to imitate
210
+ Claude using Claude's outputs as the gold reference and Claude as one
211
+ of the disagreement teachers. The teacher-disagreement signal density
212
+ argument from Spike 001 is strongest with diverse teachers. With this
213
+ trace source, the student-action is locked to one teacher family,
214
+ biasing the disagreement signal. The ADR doesn't consider this; the
215
+ ingester README doesn't flag it. **The ADR rationalizes the easy path
216
+ without naming the data-leakage tradeoff.**
217
+
218
+ ### ADR-003 (torchft for DiLoCo)
219
+
220
+ **Genuinely defensible choice.** Meta-maintained library; rolling-own
221
+ trap correctly identified; license analysis (rejecting `diloco_simple`)
222
+ is right; sign-convention risk named and tested.
223
+
224
+ **Gap is in delivery, not decision.** ADR-003 §Consequences §1 says:
225
+ "2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer
226
+ mock allreduce, assertions: replica equality after sync, params actually
227
+ moved, Nesterov state populated, sync count matches expected." Spike 008
228
+ implements one replica + passthrough manager. The ADR commits to an
229
+ implementation that the spike does not deliver, and the gap is flagged
230
+ only in the spike's verdict, not in the ADR.
231
+
232
+ If the recon doc said the pattern was "ready-to-paste" but actually
233
+ hits a sequencing bug, **the recon doc is wrong** and an adversarial
234
+ reviewer is allowed to point that out.
235
+
236
+ ---
237
+
238
+ ## (d) Scorecard inflation
239
+
240
+ The 5/10 → 9/10 update overstates. Test by test:
241
+
242
+ - **Test 6 (DiLoCo integrated in runnable stack) → ✅?**
243
+ Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and
244
+ fires on one replica. **Zero references to torchft or DiLoCo in
245
+ `composer_trainer.py`** — DiLoCo is not integrated with the trainer.
246
+ No two-replica integration test, no real distributed run.
247
+
248
+ - **Test 7 (real HF model loads + runs) → ✅?**
249
+ Yes — most legitimately closed item. Caveats from §(a) about depth
250
+ of evidence apply, but the literal test is met.
251
+
252
+ - **Test 8 (real LLM-application trace ingested end-to-end) → ✅?**
253
+ Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion
254
+ #3 ("end-to-end: real trace → ingester → collator → 1-step
255
+ `composer_total_loss`") is unmet.**
256
+
257
+ - **Test 9 (framework installable with working entrypoints) → ✅?**
258
+ Letter-of-law yes, spirit partial. `pip install -e .` works; the
259
+ quickstart runs the smoke harness. Production entrypoint
260
+ (`ComposerReplicationTrainer` driven by a config) does not exist.
261
+
262
+ - **Test 10 (non-author can complete the journey) → ✅?**
263
+ No. The supporting evidence is "Quickstart README + working
264
+ installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0."
265
+ Test 10's original journey was "I have Qwen3-7B, I want a
266
+ Composer-style variant." The parenthetical concession in the update
267
+ ("For Qwen3-7B etc., GPU phase still gates the empirical demo")
268
+ ✅'s the item anyway.
269
+
270
+ **Honest re-scoring**: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8),
271
+ 2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points.
272
+
273
+ ---
274
+
275
+ ## (e) Commit quality
276
+
277
+ ```
278
+ ac05fbf Wave 10 — packaging: composer_replication is now pip-installable
279
+ d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines)
280
+ a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo
281
+ 57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8
282
+ ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
283
+ 040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)
284
+ ```
285
+
286
+ - `ac05fbf`, `d52e126`: accurate.
287
+ - `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did
288
+ not actually run cleanly for anyone cloning before this commit. Mild
289
+ overclaim risk on `57af35d`.
290
+ - **`57af35d` is the single most overclaiming commit.** Title: "close
291
+ vision-validation gaps V2/V5/V8."
292
+ - V8: closed in the weakest sense (tautology critique above).
293
+ - V5: structural ingestion closes; BACKLOG acceptance #3 unmet.
294
+ - V2: silently re-defined (one replica, no convergence).
295
+ Three closures claimed; one partial, one redefined.
296
+ - **Chronology problem**: `040eff8` (Wave 6) declared the **5/10 → 9/10
297
+ forecast** in the commit subject. `ac4bfb4` (Wave 7, *next* commit)
298
+ added the BACKLOG and ADRs — i.e., the *plan* to make the forecast
299
+ true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without
300
+ re-auditing whether each item was actually closed in spirit. **No
301
+ commit re-audits the scorecard against actually delivered evidence.**
302
+
303
+ ---
304
+
305
+ ## (f) Adversarial reviewer's strongest line of attack
306
+
307
+ > "You have a research replication framework whose only published smoke
308
+ > is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO
309
+ > channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel
310
+ > uses dummy reference logprobs, and the GRPO channel is replaced with
311
+ > a stub. Of the three channels you advertise, **zero are tested
312
+ > end-to-end on a real HF model.** Your DiLoCo integration is one
313
+ > replica with a no-op `allreduce`. Your real-trace ingester is tested
314
+ > against a fixture you wrote yourself plus a hardcoded path on your
315
+ > laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no
316
+ > third-party validation, and one commit that closed three vision-
317
+ > validation gaps with one commit message. You are asking the reader to
318
+ > believe that a $9B-startup commercial product is replicated by a CPU
319
+ > smoke and three green test files — none of which the company itself
320
+ > would call 'replicated.'"
321
+
322
+ **Weakest defense**: "It's just v0.1 / smoke phase / GPU is the next
323
+ phase." The *commit log and scorecard claim otherwise.* The defense
324
+ "v0.1 caveat" only works if the v0.1 framing is honest at the top of
325
+ the README and scorecard — and it is not.
326
+
327
+ **Strongest actual defense**: the four primary-source-validated recon
328
+ docs and Spike 001's measured cost floor. The *thesis* is credible and
329
+ auditable. The *implementation phase* is overclaimed.
330
+
331
+ ---
332
+
333
+ ## What to fix before publishing publicly (priority order)
334
+
335
+ ### 1. Re-state the scorecard honestly (BLOCKER)
336
+ Replace 5/10 → 9/10 with **5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
337
+ List the spirit-failures explicitly (test 6 trainer integration, test 8
338
+ end-to-end, test 10 non-author). Single most important fix; everything
339
+ else compounds on the inflated scorecard.
340
+
341
+ ### 2. Fix Spike 008's V2 claim (BLOCKER)
342
+ Either (a) add a real two-replica multiprocessing test (ADR-003 says
343
+ this is feasible; the spike claims it isn't — reconcile), or (b) mark
344
+ V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica,
345
+ sign convention pinned; cross-replica convergence deferred to GPU
346
+ phase." Pick one.
347
+
348
+ ### 3. Strengthen Spike 006 against the tautology critique
349
+ Two cheap wins:
350
+ - Test that loss decreases on **two alternating fixed batches** over 10
351
+ rounds (not just one memorized batch).
352
+ - Test where **`alpha_sdpo=10.0` and SDPO actually fires** (truncate
353
+ ctx_teacher to T_s tokens for matching shape). The SDPO channel is
354
+ *not exercised on a real HF model anywhere* in the codebase. Largest
355
+ evidence gap for V8.
356
+
357
+ ### 4. Run Spike 002a-mini on the local 5090
358
+ ADR-001 made the choice; the spike was not run. Either drop the ADR
359
+ (decision deferred) or run the spike (~30 min wall-clock per ADR's own
360
+ estimate). Until then, the framework has zero GPU evidence of any kind.
361
+
362
+ ### 5. Fix the run.log / verdict.md numerical inconsistency
363
+ Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090.
364
+ Either pin the seed properly or document non-reproducibility and quote
365
+ a band rather than exact numbers.
366
+
367
+ ### 6. Acknowledge Claude Code JSONL's circularity in ADR-002
368
+ Add a "Risks accepted" entry naming the data-leakage concern: training
369
+ on Claude's outputs while Claude is in the teacher pool produces a
370
+ biased disagreement signal. Spike 007 README should also flag it.
371
+
372
+ ### 7. Decide what `compose_loss` and `build_batch` are
373
+ Either rename to `compose_loss_smoke` (and keep
374
+ `ComposerReplicationTrainer._compute_loss` as production), or make
375
+ `compose_loss` actually production-grade and demote `build_batch` out
376
+ of public API. Production-disclaimed harness as the package's headline
377
+ import is confusing.
378
+
379
+ ### 8. Eliminate dual sources of truth
380
+ `spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and
381
+ `spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`.
382
+ Make the spike import from the package; delete the duplicate.
383
+
384
+ ### 9. Add the missing real-trace end-to-end test in Spike 007
385
+ Take ingester output → Spike 005 data collator → 1 step of `compose_loss`.
386
+ This is BACKLOG acceptance #3; ~50 lines of test code closes V5's
387
+ spirit gap.
388
+
389
+ ### 10. Fix the sign-convention docstring in `composer_diloco.py`
390
+ Replace the incoherent "wrong-sign + SGD subtract = right answer with
391
+ momentum" gloss with: *"DiLoCo defines pseudo-gradient as
392
+ `θ_initial − θ_local`; this is the negative of the local update
393
+ direction, and standard SGD subtracts gradients, so the outer step
394
+ moves in the local-update direction. No negation required."* The test
395
+ is correct; the prose explaining it isn't.
396
+
397
+ ---
398
+
399
+ ## Credit where due
400
+
401
+ - **Spike 007's `ClaudeCodeIngester`** is real, working, well-tested
402
+ software with non-trivial logic (sidechain skip, thinking-block
403
+ strip-on-replay, malformed-line tolerance). The synthetic fixture
404
+ exercises the structural cases properly.
405
+ - **Spike 008's pseudogradient-sign-convention test** is the single
406
+ best test in all of Wave 7-10. It pins a known torchft hazard with an
407
+ explicit arithmetic prediction and a `wrong_sign_diff` reported on
408
+ failure.
409
+ - **Spike 006's α=0 / β=0 ablation tests** would catch real regressions
410
+ and document channel-disable semantics.
411
+ - **All three ADRs are properly traceable to recon documents**
412
+ (MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE,
413
+ DILOCO_RECONNAISSANCE). The decisions can be challenged; the *process*
414
+ is auditable, which is rare.
415
+ - **Package layout** (`loss`, `batch`, `opsd`, `teacher_replay`,
416
+ `ingestion/claude_code`, `diloco`, `trainer`) is sane; optional
417
+ extras correctly avoid forcing TRL/torchft on every install.
418
+
419
+ The work product is not zero. It is overclaimed by roughly one
420
+ scorecard tier and one BACKLOG acceptance criterion. Fixing items
421
+ 1, 2, 3, 5 above moves the framework from "publishable with a generous
422
+ reviewer" to "publishable with a critical reviewer." Items 4 and 6
423
+ move it from "research replication" to "evidenced research replication."
spikes/008-streaming-diloco/composer_diloco.py CHANGED
@@ -11,21 +11,25 @@ Wraps `torchft.local_sgd.DiLoCo` with the framework's conventions:
11
  Reference: `docs/adrs/ADR-003-diloco-impl.md`.
12
 
13
  Sign convention (READ THIS BEFORE TOUCHING):
14
- torchft's `_save_grads()` (line 324 of torchft/local_sgd.py) computes
15
- grad = θ_initial - θ_local
16
- and stores it as `param.grad` for the outer optimizer to consume.
17
- The outer optimizer then runs `param.data -= lr * grad`, equivalently
18
- θ_new = θ_local + lr * (θ_initial - θ_local) if outer optimizer is plain SGD
19
- which slurps the local-trained-θ TOWARD the initial-θ instead of away
20
- from it. That looks wrong, but it's correct for SGD-with-Nesterov-momentum
21
- on outer loop: the outer optimizer accumulates the negative-grad-direction
22
- history, so the "wrong-sign" pseudogradient combined with SGD's "subtract
23
- grad" semantics gives net "step in the local-Δ direction" once momentum
24
- builds up. This is consistent with the DiLoCo paper's pseudo-code.
25
-
26
- Bottom line: do NOT negate. torchft's pseudogradient sign + SGD outer
27
- optimizer is the correct combination. Spike 008's
28
- `test_diloco_pseudogradient_sign_convention` test catches a sign flip.
 
 
 
 
29
  """
30
  from __future__ import annotations
31
 
 
11
  Reference: `docs/adrs/ADR-003-diloco-impl.md`.
12
 
13
  Sign convention (READ THIS BEFORE TOUCHING):
14
+ DiLoCo defines pseudo-gradient as
15
+
16
+ pseudograd = θ_initial - θ_local
17
+
18
+ (per torchft's `_save_grads()` at line 324 of `torchft/local_sgd.py`).
19
+ This is the **negative** of the local update direction.
20
+
21
+ Standard SGD subtracts gradients: `p.data p.data - lr * grad`.
22
+ So when the outer optimizer runs after `restore_parameters()` puts
23
+ p.data back to θ_initial:
24
+
25
+ p.data ← θ_initial - lr * (θ_initial - θ_local)
26
+ = θ_initial + lr * (θ_local - θ_initial)
27
+
28
+ For `lr=1, momentum=0` this lands exactly at θ_local. For `lr<1` it
29
+ interpolates between θ_initial and θ_local. Nesterov momentum
30
+ accumulates the local-update direction across outer rounds.
31
+
32
+ No negation in our outer optimizer wrapper.
33
  """
34
  from __future__ import annotations
35
 
spikes/008-streaming-diloco/verdict.md CHANGED
@@ -1,29 +1,62 @@
1
  # Spike 008 — VERDICT
2
 
3
- **Status**: PASSED
4
  **Date**: 2026-05-26
5
  **Wave**: 9
 
6
 
7
  ## Headline
8
 
9
  `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained)
10
- to integrate vanilla DiLoCo / Streaming DiLoCo as the outer-loop optimizer for the
11
- Composer Replication Framework. 5/5 unit tests pass single-process. Sign convention
12
- of pseudo-gradient pinned down by an explicit unit test.
13
-
14
- ## Acceptance criteria
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  | Criterion | Status |
17
  |---|---|
18
- | Outer loop machinery fires (allreduce + start_quorum + outer step) | ✅ test 1 |
19
  | Nesterov momentum state populated for every parameter | ✅ test 1 |
20
- | Pseudo-gradient sign convention verified (`θ_initial − θ_local`) | ✅ test 2 |
21
  | No regression in Spike 005 imports | ✅ test 3 |
22
  | `make_diloco_outer_loop()` factory wraps the right object | ✅ test 4 |
23
- | Streaming DiLoCo with 2 fragments constructs cleanly | ✅ test 5 |
24
- | Spike 005's 38 tests still pass | ✅ verified separately |
25
 
26
- ## Sign convention pinned down (the most important result)
27
 
28
  Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
29
 
@@ -31,43 +64,43 @@ Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
31
  pseudograd = θ_initial − θ_local
32
  ```
33
 
34
- The outer optimizer then runs `p.data ← θ_initial lr * pseudograd`. With
35
- `lr=1, momentum=0`, this resolves to `θ_local` (the outer step undoes the
36
- restore-to-θ_initial). The test exercises this exact math with
37
- `local_param_after_nudge = θ_initial + 0.5` and asserts final ≈ θ_local.
38
 
39
- A sign flip in either `_save_grads` or the outer optimizer would land us at
40
- `θ_initial - 0.5` (movement in the wrong direction). The test reports both
41
- values in the failure message so a future flip is immediately diagnosable.
 
 
 
42
 
43
- ## What this closes
44
 
45
- - **V2** (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md` — promotes
46
- DiLoCo from "documented gap" to "real working integration with sign-convention
47
- tested."
48
 
49
  ## What this does NOT close
50
 
51
- - True multi-replica convergence in single-process. The recon doc's pattern of
52
- "real averaging across replicas via shared buffer" hits a sequencing bug:
53
- replica A's `inner.step()` post-hook completes the entire prepare→perform
54
- sync sequence BEFORE replica B's post-hook starts, so the cross-replica
55
- average can't complete in time for A's outer step. This is the SAME
56
- limitation torchft's own tests have they don't test convergence in
57
- single-process either. True cross-replica convergence is verified in
58
- production by NCCL with two real processes. For now, single-process tests
59
- verify the *machinery* (sync fires, outer optimizer steps, Nesterov state
60
- populates).
61
-
62
- - Streaming DiLoCo with `fragment_sync_delay > 0` and overlapped sync
63
- (requires CUDA streams). The framework's `make_diloco_outer_loop()` accepts
64
- the parameter; Spike 008 exercises only `delay=0` (vanilla DiLoCo).
65
 
66
  ## Files
67
 
68
- - `composer_diloco.py` — `make_diloco_outer_loop()` wrapper. Documents the
69
- sign convention LOUDLY (per ADR-003).
70
- - `tests/test_diloco_smoke.py` — 5 acceptance tests.
 
71
 
72
  ## Dependencies added
73
 
@@ -76,4 +109,4 @@ values in the failure message so a future flip is immediately diagnosable.
76
  ## Cost / time
77
 
78
  - Pure CPU, single process, no GPU.
79
- - Test suite: 4.7 seconds for 5 tests.
 
1
  # Spike 008 — VERDICT
2
 
3
+ **Status**: ⚠️ PARTIAL (acceptance criterion redefined; see § "Honest re-statement" below)
4
  **Date**: 2026-05-26
5
  **Wave**: 9
6
+ **Cross-model review**: BLOCKER 2 of `docs/research/WAVE_7_10_FINAL_REVIEW.md`
7
 
8
  ## Headline
9
 
10
  `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained)
11
+ and the framework's outer-loop sign convention is pinned by an explicit unit
12
+ test. **Cross-replica convergence is NOT verified** the BACKLOG acceptance
13
+ criterion required two replicas; what shipped is one replica + passthrough
14
+ no-op `allreduce`.
15
+
16
+ ## Honest re-statement
17
+
18
+ The BACKLOG required:
19
+
20
+ > Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model
21
+ > from Spike 005, both replicas converge toward the same solution within
22
+ > tolerance.
23
+
24
+ What was attempted: the recon doc (`DILOCO_RECONNAISSANCE.md`) provided a
25
+ "ready-to-paste" 2-replica pattern with a shared-buffer mock allreduce that
26
+ averages tensors across replicas. **That pattern does not work in single-process**:
27
+ each `inner.step()`'s post-hook runs `prepare_sync` + `perform_sync` to
28
+ completion (including outer optimizer step) before yielding back to the
29
+ caller. By the time replica B's post-hook starts, replica A has already
30
+ finished its outer step using A's *un*-averaged pseudo-gradient. The mock
31
+ allreduce can compute the cross-replica mean, but it can't write that mean
32
+ back into A's `_grads[name]` buffer in time for A's outer step.
33
+
34
+ The fix would require either:
35
+ - A real `torch.distributed` barrier (NCCL or Gloo) — out of scope for a
36
+ CPU-only single-process smoke.
37
+ - A multi-process test using `torch.multiprocessing.spawn` with two real
38
+ processes — feasible but ~200 LOC of additional test infrastructure that
39
+ would need its own review.
40
+
41
+ What ships instead: a single-replica machinery test (`allreduce` is a no-op
42
+ passthrough). Verifies that outer optimizer fires, Nesterov state populates,
43
+ sign convention is correct. Does NOT verify cross-replica convergence.
44
+ **This is a redefinition of the BACKLOG acceptance criterion.** Documented
45
+ explicitly in this verdict + the test file's `_make_passthrough_manager`
46
+ docstring + composer_diloco.py.
47
+
48
+ ## What the test suite DOES verify (5/5 pass)
49
 
50
  | Criterion | Status |
51
  |---|---|
52
+ | `allreduce`/`start_quorum`/`should_commit` fire at the right step boundaries | ✅ test 1 |
53
  | Nesterov momentum state populated for every parameter | ✅ test 1 |
54
+ | **Pseudo-gradient sign convention** verified (`θ_initial − θ_local`) with explicit arithmetic prediction | ✅ test 2 |
55
  | No regression in Spike 005 imports | ✅ test 3 |
56
  | `make_diloco_outer_loop()` factory wraps the right object | ✅ test 4 |
57
+ | Streaming DiLoCo with 2 fragments + nonzero `fragment_sync_delay` accepts the config | ✅ test 5 |
 
58
 
59
+ ## Sign convention pinned (the most important result here)
60
 
61
  Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
62
 
 
64
  pseudograd = θ_initial − θ_local
65
  ```
66
 
67
+ DiLoCo defines pseudo-gradient as `θ_initial - θ_local`. This is the
68
+ negative of the local update direction. Standard SGD subtracts gradients
69
+ (`p ← p - lr * grad`), so the outer step moves in the local-update
70
+ direction. No negation needed in our outer optimizer wrapper.
71
 
72
+ The test exercises this exact math with `local_param_after_nudge =
73
+ θ_initial + 0.5` and asserts final `θ_local`. A sign flip in either
74
+ `_save_grads` or the outer optimizer would land at `θ_initial - 0.5`
75
+ (movement in the wrong direction); the test reports both values in the
76
+ failure message so a future flip is immediately diagnosable. **This is
77
+ the single best test in Wave 7-10** per the cross-model reviewer.
78
 
79
+ ## What this CLAIMS to close
80
 
81
+ - **V2** (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md` —
82
+ re-scored as **⚠️ partial**, not ✅, in the 2026-05-26 update at the
83
+ bottom of § 3 of that doc.
84
 
85
  ## What this does NOT close
86
 
87
+ - **True multi-replica convergence** see § "Honest re-statement" above.
88
+ Either needs a real `torch.distributed` test on multiple processes, or
89
+ a redesigned single-process pattern that overrides `_DummyWork.wait()`
90
+ to do lazy averaging.
91
+ - **Trainer integration** `ComposerReplicationTrainer` does NOT yet use
92
+ `make_diloco_outer_loop`. The DiLoCo wrapper is an independent context
93
+ manager. Wiring it into the trainer's lifecycle is a separate spike.
94
+ - **Streaming DiLoCo with `fragment_sync_delay > 0`** (overlapped sync,
95
+ CUDA streams). The framework's `make_diloco_outer_loop()` accepts the
96
+ parameter; tests only exercise `delay=0` (vanilla DiLoCo).
 
 
 
 
97
 
98
  ## Files
99
 
100
+ - `composer_diloco.py` — `make_diloco_outer_loop()` wrapper. Documents
101
+ the sign convention.
102
+ - `tests/test_diloco_smoke.py` — 5 acceptance tests. Test 2 (sign
103
+ convention) is the highest-value test.
104
 
105
  ## Dependencies added
106
 
 
109
  ## Cost / time
110
 
111
  - Pure CPU, single process, no GPU.
112
+ - Test suite: ~5 seconds for 5 tests.