Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 11: cross-model adversarial review + honest down-revision
Browse filesPhase 11 of the deep work loop ran a cross-model adversarial review on the
Wave 7-10 work. The review (anthropic/claude-opus-4.7, separate session) found
10 substantive flaws including 2 BLOCKERs:
1. **Scorecard inflation** — initial self-claim was 5/10 → 9/10, honest
re-scoring is 5/10 → 7/10 ✅ + 1/10 ⚠️ + 2/10 ❌-spirit.
2. **Spike 008 acceptance criterion silently redefined** — BACKLOG required
"2 replicas converge"; what shipped is "1 replica + passthrough no-op
allreduce." The recon doc's "ready-to-paste 2-replica pattern" hits a
single-process post-hook sequencing bug.
Both BLOCKERs addressed in this commit:
- docs/research/WAVE_7_10_FINAL_REVIEW.md — full 423-line review committed.
- docs/VISION_VALIDATION.md — replace 9/10 self-claim with honest 7/10
re-scoring + per-test caveats acknowledging the spirit-failures of tests
6/8/10. Explicitly notes "the framework as of this commit has zero GPU
evidence of any kind."
- spikes/008-streaming-diloco/verdict.md — status changed from "✅ PASSED"
to "⚠️ PARTIAL." Adds a "Honest re-statement" section explaining the
acceptance-criterion redefinition and what would be needed to close the
BACKLOG criterion in spirit (multi-process test ~200 LOC).
- README.md — Spike 6 marked "PASSED with caveat" (SDPO channel never
exercises on a real model in the smoke). Spike 7 acknowledges the
end-to-end ingester→collator→loss test is missing. Spike 8 marked ⚠️
PARTIAL not 🟢 PASSED.
Two cheaper improvements also landed:
- docs/adrs/ADR-002-trace-source.md — adds a "Risk added 2026-05-26 by
cross-model review" section flagging the Claude Code circularity:
training a student on Claude's outputs while Claude is in the teacher
pool produces biased disagreement. Mitigation (drop Claude from the
pool when ingesting Claude traces) documented; not yet enforced in code.
- composer_replication/diloco/__init__.py + spikes/008-.../composer_diloco.py
— sign-convention docstrings rewritten. Old prose ("wrong-sign +
SGD subtract = right answer with momentum") was incoherent. New version
derives p.data ← θ_initial - lr*pseudograd = θ_initial + lr*(θ_local -
θ_initial) cleanly, no hand-waving about momentum.
Items NOT addressed in this commit (deferred to next session or post-GPU):
- 3. Strengthen Spike 006 against tautology (use 2 alternating batches +
real SDPO firing path)
- 4. Run Spike 002a-mini on local 5090 (zero GPU evidence anywhere yet)
- 5. Reconcile run.log vs verdict.md numerical inconsistency (seed pinning)
- 7. Decide what compose_loss/build_batch are (verification harness vs
production)
- 8. Eliminate dual sources of truth (spike copy vs package copy)
- 9. Add the missing real-trace end-to-end test in Spike 007
The deep work loop's design says "publish the cross-model review with the
work, not before reviewing." This commit honors that — the BLOCKER fixes
land BEFORE the public push.
Tests still 67/67 across spikes 005/007/008 (Spike 008 changes are
docstring-only, machinery test still passes).
- README.md +4 -4
- composer_replication/diloco/__init__.py +25 -15
- docs/VISION_VALIDATION.md +18 -0
- docs/adrs/ADR-002-trace-source.md +19 -0
- docs/research/WAVE_7_10_FINAL_REVIEW.md +423 -0
- spikes/008-streaming-diloco/composer_diloco.py +19 -15
- spikes/008-streaming-diloco/verdict.md +73 -40
|
@@ -48,10 +48,10 @@ for what the output should look like.
|
|
| 48 |
**v0.1 spike progress (2026-05-26):**
|
| 49 |
- 🟢 Spike 001 (kill-switch teacher cost) — **VALIDATED**: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
|
| 50 |
- 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
|
| 51 |
-
- 🟢 Spike 006 (real HF model smoke) — **PASSED**: Qwen2.5-0.5B-Instruct via `AutoModelForCausalLM`, 5 backward steps on CPU, loss 0.7390 → 0.0031
|
| 52 |
-
- 🟢 Spike 007 (real trace ingestion) — **PASSED**: `ClaudeCodeIngester.ingest()` converts Claude Code session JSONL → `TraceState` records. 15/15 tests including a real-session smoke.
|
| 53 |
-
-
|
| 54 |
-
- 🟢 Wave 10 (packaging) — **DONE**: `pip install -e .` works; `composer_replication` package re-exports the verified APIs from the spike directories.
|
| 55 |
- 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
|
| 56 |
|
| 57 |
📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
|
|
|
|
| 48 |
**v0.1 spike progress (2026-05-26):**
|
| 49 |
- 🟢 Spike 001 (kill-switch teacher cost) — **VALIDATED**: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
|
| 50 |
- 🟢 Spike 005 (integrated 3-channel trainer skeleton) — **SKELETON-VALIDATED**: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
|
| 51 |
+
- 🟢 Spike 006 (real HF model smoke) — **PASSED with caveat**: Qwen2.5-0.5B-Instruct via `AutoModelForCausalLM`, 5 backward steps on CPU, loss 0.7390 → 0.0031, all gradients finite. Closes vision-validation gap V8 in the literal sense. Caveat: the SDPO channel is `0.0` throughout (silently disabled by ctx-shape mismatch — the fallback is correct but means SDPO is not exercised end-to-end on a real model anywhere yet) and DPO uses dummy reference logprobs.
|
| 52 |
+
- 🟢 Spike 007 (real trace ingestion) — **PASSED**: `ClaudeCodeIngester.ingest()` converts Claude Code session JSONL → `TraceState` records. 15/15 tests including a real-session smoke. ⚠️ End-to-end ingester → collator → loss test still missing (V5 spirit gap); see VISION_VALIDATION § 3 update.
|
| 53 |
+
- ⚠️ Spike 008 (DiLoCo outer-loop smoke) — **PARTIAL**: `make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo`. 5/5 single-process tests pass including a pseudo-gradient sign-convention pin. **But** the BACKLOG required a 2-replica convergence smoke; what shipped is 1-replica machinery + passthrough no-op `allreduce`. True multi-process DiLoCo is GPU-gated and not yet attempted.
|
| 54 |
+
- 🟢 Wave 10 (packaging) — **DONE**: `pip install -e .` works; `composer_replication` package re-exports the verified APIs from the spike directories. `compose_loss` and `build_batch` are explicitly verification-harness public APIs (production loss is `ComposerReplicationTrainer._compute_loss`).
|
| 55 |
- 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
|
| 56 |
|
| 57 |
📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
|
|
@@ -11,21 +11,31 @@ Wraps `torchft.local_sgd.DiLoCo` with the framework's conventions:
|
|
| 11 |
Reference: `docs/adrs/ADR-003-diloco-impl.md`.
|
| 12 |
|
| 13 |
Sign convention (READ THIS BEFORE TOUCHING):
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
"""
|
| 30 |
from __future__ import annotations
|
| 31 |
|
|
|
|
| 11 |
Reference: `docs/adrs/ADR-003-diloco-impl.md`.
|
| 12 |
|
| 13 |
Sign convention (READ THIS BEFORE TOUCHING):
|
| 14 |
+
DiLoCo defines pseudo-gradient as
|
| 15 |
+
|
| 16 |
+
pseudograd = θ_initial - θ_local
|
| 17 |
+
|
| 18 |
+
(per torchft's `_save_grads()` at line 324 of `torchft/local_sgd.py`).
|
| 19 |
+
This is the **negative** of the local update direction (the local
|
| 20 |
+
update *moved* params from θ_initial toward θ_local).
|
| 21 |
+
|
| 22 |
+
Standard SGD subtracts gradients: `p.data ← p.data - lr * grad`.
|
| 23 |
+
So when the outer optimizer runs after `restore_parameters()` puts
|
| 24 |
+
p.data back to θ_initial:
|
| 25 |
+
|
| 26 |
+
p.data ← θ_initial - lr * (θ_initial - θ_local)
|
| 27 |
+
= θ_initial + lr * (θ_local - θ_initial)
|
| 28 |
+
|
| 29 |
+
For `lr=1, momentum=0` this lands exactly at θ_local. For `lr<1` it
|
| 30 |
+
interpolates between θ_initial and θ_local (the standard DiLoCo outer
|
| 31 |
+
step). Adding Nesterov momentum accumulates the local-update direction
|
| 32 |
+
across outer rounds.
|
| 33 |
+
|
| 34 |
+
No negation in our outer optimizer wrapper. The test
|
| 35 |
+
`test_diloco_pseudogradient_sign_convention` in
|
| 36 |
+
`spikes/008-streaming-diloco/tests/test_diloco_smoke.py` pins this
|
| 37 |
+
arithmetic and reports both the expected and the wrong-sign value on
|
| 38 |
+
failure for fast diagnosis if torchft ever flips its convention.
|
| 39 |
"""
|
| 40 |
from __future__ import annotations
|
| 41 |
|
|
@@ -59,6 +59,24 @@ Ten concrete pass/fail tests covering both "do we encapsulate the vision" and "i
|
|
| 59 |
|
| 60 |
**Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
## 4. The four real gaps, each examined
|
| 63 |
|
| 64 |
### 4.1 V2: DiLoCo deferral — is this a drift?
|
|
|
|
| 59 |
|
| 60 |
**Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
|
| 61 |
|
| 62 |
+
> **Update 2026-05-26 — Wave 7+8+9+10 closeout (deep work loop) + cross-model audit**
|
| 63 |
+
>
|
| 64 |
+
> Initial self-claim was 5/10 → 9/10. A cross-model adversarial review (Phase 11 of the deep work loop, doc at `docs/research/WAVE_7_10_FINAL_REVIEW.md`) found three of those ✅s were letter-of-the-law rather than spirit. **Honest re-scoring: 5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
|
| 65 |
+
>
|
| 66 |
+
> | # | Test | Status | New evidence + honest caveat |
|
| 67 |
+
> |---|---|---|---|
|
| 68 |
+
> | 6 | DiLoCo integrated in runnable stack | ⚠️ partial | Spike 008 has `composer_replication.diloco.make_diloco_outer_loop` wrapping `torchft.local_sgd.DiLoCo`, with a 5/5 test suite that pins the sign convention. **But**: the BACKLOG required a *2-replica convergence* smoke; what shipped is a 1-replica machinery test with a passthrough no-op `allreduce`. The recon doc's "ready-to-paste 2-replica pattern" hits a single-process post-hook sequencing bug we couldn't fix without rewriting torchft. The DiLoCo wrapper is also **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. Calling V2 ✅ overstates: real DiLoCo training is GPU-multi-process which we haven't touched. |
|
| 69 |
+
> | 7 | Real HF model loads + runs through `compose_loss` | ✅ (with caveat below) | Spike 006 — Qwen2.5-0.5B-Instruct on CPU, 5 backward steps, loss 0.7390 → 0.0031, 9/9 tests. **Caveat**: SDPO channel is `0.0` throughout (silently disabled by ctx_student vs ctx_teacher shape mismatch — correct fallback, but means SDPO is not exercised end-to-end on a real model anywhere in the repo yet). DPO uses dummy reference logprobs. The 5-step loss decrease on a fixed batch is closer to "memorization works" than "the 3-channel composition is correct." Still: the framework now demonstrably loads real HF models, which it didn't before. V8 is closed in the literal sense. |
|
| 70 |
+
> | 8 | Real LLM-application trace ingested end-to-end | ❌ spirit | Spike 007 — `ClaudeCodeIngester` ingests real Claude Code session JSONL → `TraceState` records, 15/15 tests including a real-session smoke. **But**: BACKLOG acceptance criterion #3 said "end-to-end smoke: real trace → ingester → collator → 1-step `compose_loss`." That last hop is **not tested**. The spike stops at "ingester emits TraceStates correctly." Closing V5 in spirit needs a 50-LOC test that pipes ingested records all the way through the loss. Open. |
|
| 71 |
+
> | 9 | Framework is *installable* with working entrypoints | ✅ | Wave 10 — `pyproject.toml` ships `composer_replication` package, `pip install -e .` works, `examples/qwen_05b_quickstart/run.py` runs end-to-end via the package API. (Caveat acknowledged: `compose_loss` is documented as a verification harness, not production. The production loss is `ComposerReplicationTrainer._compute_loss`.) |
|
| 72 |
+
> | 10 | Non-author can complete the "I have X, I want a Composer variant" journey | ❌ spirit | Quickstart works for "verify the loss composition runs" but not for "train a real model" — that requires real GRPO rollouts, real teacher calls, and GPU. The brief's intended user wants the latter. We have not closed that path. |
|
| 73 |
+
>
|
| 74 |
+
> **The remaining 1/10 + 2/10 spirit gaps + the unverified 9/10 ⚠️** are the post-replication GPU phase: Spike 002a/b (real trace collection on GPU), Spike 003 (DPO-pair signal density), Spike 004 (A/B SWE-bench-lite), and a real-multi-process DiLoCo test. Those are GPU-budget-gated and out of scope for the deep work loop's CPU-only constraint.
|
| 75 |
+
>
|
| 76 |
+
> **Time spent on Wave 7-10**: ~1 session. **No GPU spend.** Modal evaluated but rejected for the smoke phase (ADR-001 — local 5090 wins on iteration cycle 10× over Modal L4 for 0.5B verification work). **The local 5090 was also not used** — Spike 002a-mini (the planned local-GPU smoke) was not run. The framework as of this commit has zero GPU evidence of any kind. That is honest about where this work lands: **a tested, installable methodology repo with real CPU smokes and primary-source-validated research, not a trained model.**
|
| 77 |
+
>
|
| 78 |
+
> Cross-model review's full priority list (10 items, ranked) is in `docs/research/WAVE_7_10_FINAL_REVIEW.md`. Items 1-2 (scorecard honesty + V2 re-statement) are addressed by this update. Items 3-10 are open.
|
| 79 |
+
|
| 80 |
## 4. The four real gaps, each examined
|
| 81 |
|
| 82 |
### 4.1 V2: DiLoCo deferral — is this a drift?
|
|
@@ -117,6 +117,25 @@ Wins on every axis we care about for Spike 007:
|
|
| 117 |
need to bump a `schema_version` constant in the ingester. Acceptable
|
| 118 |
ongoing maintenance burden.
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
### Future ingesters
|
| 121 |
|
| 122 |
Open the door for two more ingesters in v0.2:
|
|
|
|
| 117 |
need to bump a `schema_version` constant in the ingester. Acceptable
|
| 118 |
ongoing maintenance burden.
|
| 119 |
|
| 120 |
+
### Risk added 2026-05-26 by cross-model review (NOT BLOCKING but TO DOCUMENT)
|
| 121 |
+
|
| 122 |
+
- **Circularity / data-leakage in the teacher-replay channel.** Claude
|
| 123 |
+
Code traces are produced by Claude. Our default teacher pool
|
| 124 |
+
(`DEFAULT_TEACHERS`) includes `anthropic/claude-opus-4.7`. Training a
|
| 125 |
+
student on Claude's outputs while Claude is one of the teachers
|
| 126 |
+
voting on what the student should do produces a biased disagreement
|
| 127 |
+
signal: Claude's vote is correlated with the trace's existing
|
| 128 |
+
`student_action` (which Claude originally produced). This biases the
|
| 129 |
+
multi-teacher consensus toward the existing answer.
|
| 130 |
+
- **Mitigation**: when ingesting Claude Code traces, the user should
|
| 131 |
+
drop Claude from the teacher pool and use a non-Claude consensus
|
| 132 |
+
(Opus 4.7 → GPT-5 + DeepSeek V4-Pro, or any non-Claude pair).
|
| 133 |
+
Documented here; not yet enforced in code.
|
| 134 |
+
- **Open question for v0.2**: should `ClaudeCodeIngester` automatically
|
| 135 |
+
annotate the source-model field on each trace and `replay_trace`
|
| 136 |
+
automatically exclude same-family teachers? Defer the design until
|
| 137 |
+
the post-replication phase reveals whether the bias is observable.
|
| 138 |
+
|
| 139 |
### Future ingesters
|
| 140 |
|
| 141 |
Open the door for two more ingesters in v0.2:
|
|
@@ -0,0 +1,423 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave 7–10 Final Review — Cross-model adversarial check
|
| 2 |
+
|
| 3 |
+
**Reviewer**: external model, Phase 11 of the deep work loop.
|
| 4 |
+
**Date**: 2026-05-26.
|
| 5 |
+
**Mandate**: find substantive flaws. The research thesis is
|
| 6 |
+
primary-source-validated; this attacks *implementation correctness* and
|
| 7 |
+
*scope creep*, not the thesis.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## (a) Are the tests real evidence or theater?
|
| 12 |
+
|
| 13 |
+
### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests)
|
| 14 |
+
|
| 15 |
+
**Verdict: mostly tautology, with usable ablation tests.**
|
| 16 |
+
|
| 17 |
+
The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is
|
| 18 |
+
technically true and substantively near-tautological:
|
| 19 |
+
|
| 20 |
+
1. **The same fixed ~50-token batch is reused for all 5 steps.**
|
| 21 |
+
`build_batch` returns one conversation; the test loop calls
|
| 22 |
+
`compose_loss(model, batch)` five times in a row. No reshuffle, no
|
| 23 |
+
second batch, no held-out anything.
|
| 24 |
+
2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch ×
|
| 25 |
+
5 steps = textbook memorization regime.** A randomly-initialized MLP
|
| 26 |
+
would also reduce loss in this setup. The test does not distinguish
|
| 27 |
+
"the 3-channel composition is correct" from "AdamW reduces fixed-batch
|
| 28 |
+
loss on any non-degenerate objective."
|
| 29 |
+
3. **The SDPO channel is zero throughout** (`sdpo_jsd=0.0` on every
|
| 30 |
+
row of `loss_curve.csv`). The verdict calls this "correct fallback
|
| 31 |
+
behavior"; what it actually is is *the entire SDPO channel never
|
| 32 |
+
being tested by this smoke*. The fallback is a literal `_zero(device)`.
|
| 33 |
+
`generalized_jsd_loss` has no end-to-end test on a real HF model
|
| 34 |
+
anywhere in the codebase. **This is the largest evidence gap for
|
| 35 |
+
V8.**
|
| 36 |
+
4. **DPO uses dummy hard-coded reference logprobs** (`-30.0`, `-35.0`).
|
| 37 |
+
This tests that `-logsigmoid(small_positive)` is differentiable, not
|
| 38 |
+
that the trace-replay-DPO pipeline (reference-policy precompute +
|
| 39 |
+
collator + loss) wires together.
|
| 40 |
+
5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the
|
| 41 |
+
weakest version of monotonicity.
|
| 42 |
+
|
| 43 |
+
**Genuine value**: model-loads test, chat-template test, the three
|
| 44 |
+
α=0 / β=0 ablation tests. The ablations would catch a regression where
|
| 45 |
+
weights stop disabling channels. The "5-step decrease" is the weakest
|
| 46 |
+
test in the file.
|
| 47 |
+
|
| 48 |
+
**Run.log inconsistency, not flagged anywhere**:
|
| 49 |
+
`examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379;
|
| 50 |
+
the spike `verdict.md` quotes step-1 total = 0.2090 for the same code,
|
| 51 |
+
same model. Either the seed isn't pinned through the model forward
|
| 52 |
+
(likely — `torch.manual_seed(42)` is in `build_batch` only), or the
|
| 53 |
+
package's `compose_loss` differs subtly from the spike's. **Quoting
|
| 54 |
+
exact numbers from a non-reproducible run as evidence is the sloppy
|
| 55 |
+
version of every research-replication scandal.**
|
| 56 |
+
|
| 57 |
+
### Spike 007 (Claude Code ingester, 15 tests)
|
| 58 |
+
|
| 59 |
+
**Verdict: strongest test suite of the three. Caveats apply.**
|
| 60 |
+
|
| 61 |
+
Real engineering value:
|
| 62 |
+
- Synthetic fixture exercises the actual record types (assistant,
|
| 63 |
+
user/tool_result, summary, system, sidechain). Tests assert structural
|
| 64 |
+
properties: history grows monotonically, `[THINKING]` stripped on
|
| 65 |
+
replay but kept in student_action, unique state_ids, tool_use
|
| 66 |
+
serialization, tool_result tagging.
|
| 67 |
+
- `test_truncated_line_tolerated` would catch a real failure-mode
|
| 68 |
+
removal of the JSON-decode try/except.
|
| 69 |
+
- Subagent and sidechain skip tests catch real production cases.
|
| 70 |
+
|
| 71 |
+
Caveats:
|
| 72 |
+
- **The "real session" test is hardcoded to one path on the author's
|
| 73 |
+
machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`).
|
| 74 |
+
No env var, no fixture-discovery; the test is `skipif(not exists)`.
|
| 75 |
+
This is a manual integration test, not a CI test. ADR-002 said "CI
|
| 76 |
+
users substitute their own"; the substitution mechanism doesn't
|
| 77 |
+
exist.
|
| 78 |
+
- The synthetic fixture is **author-written** and presumably designed
|
| 79 |
+
alongside the ingester. There is no scrubbed third-party fixture.
|
| 80 |
+
- Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace →
|
| 81 |
+
ingester → collator → 1-step `composer_total_loss`") is **unmet** —
|
| 82 |
+
the spike stops at "ingester emits TraceStates correctly." There is
|
| 83 |
+
no test that takes ingested records, runs the data collator, and runs
|
| 84 |
+
through `compose_loss`.
|
| 85 |
+
|
| 86 |
+
This suite would catch real regressions in the ingester. Its weakest
|
| 87 |
+
property: ships no contributor-runnable real-trace test.
|
| 88 |
+
|
| 89 |
+
### Spike 008 (DiLoCo, 5 tests, single-process)
|
| 90 |
+
|
| 91 |
+
**Verdict: the caveat is honest but says the test does not test what
|
| 92 |
+
users will assume it tests.**
|
| 93 |
+
|
| 94 |
+
BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps
|
| 95 |
+
× 2 outer rounds on the toy model from Spike 005, both replicas converge
|
| 96 |
+
toward the same solution within tolerance."*
|
| 97 |
+
|
| 98 |
+
What ships: **one** replica, mock manager whose `allreduce` is a
|
| 99 |
+
**`passthrough` no-op** (test_diloco_smoke.py:78). This is "one
|
| 100 |
+
replica's outer optimizer machinery fires," not "two replicas
|
| 101 |
+
converge." The acceptance criterion was silently re-defined; the
|
| 102 |
+
spike's verdict.md calls this a "limitation" but it is a redefinition.
|
| 103 |
+
|
| 104 |
+
The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real
|
| 105 |
+
shared-buffer averaging. The implementation hits a "post-hook
|
| 106 |
+
sequencing bug." **One of the recon claim and the implementation is
|
| 107 |
+
wrong**, and the gap is buried in verdict.md instead of fixed.
|
| 108 |
+
|
| 109 |
+
**Genuine value**: `test_diloco_pseudogradient_sign_convention` is the
|
| 110 |
+
**single best test in all of Wave 7-10**. It pins the sign convention
|
| 111 |
+
with a concrete arithmetic prediction (`final == θ_initial + nudge`)
|
| 112 |
+
and reports `wrong_sign_diff` on failure. A future torchft upgrade that
|
| 113 |
+
flips the sign breaks this test loudly. ADR-003 specifically flagged
|
| 114 |
+
this hazard, and the test catches it. Credit where due.
|
| 115 |
+
|
| 116 |
+
**Separate flaw in `composer_diloco.py` docstring (lines 13–28)**: the
|
| 117 |
+
"wrong-sign pseudogradient combined with SGD's subtract-grad semantics
|
| 118 |
+
gives net step in the local-Δ direction once momentum builds up" gloss
|
| 119 |
+
is incoherent. There is no "wrong-sign" pseudogradient.
|
| 120 |
+
`θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's
|
| 121 |
+
`p ← p − lr·g` semantics are designed for it. The test is correct; the
|
| 122 |
+
prose explaining why is wrong, and will mislead anyone porting the
|
| 123 |
+
convention.
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## (b) Is the package a real framework or a shim?
|
| 128 |
+
|
| 129 |
+
**Verdict: a structured shim around three real components and two
|
| 130 |
+
stubs. Not yet a framework.**
|
| 131 |
+
|
| 132 |
+
What `pip install composer-replication` delivers:
|
| 133 |
+
- `compose_loss` — labeled in its own top docstring as "Do NOT use as
|
| 134 |
+
the production training loss." Re-exported as the headline package
|
| 135 |
+
API and used in the quickstart.
|
| 136 |
+
- `build_batch` — a hard-coded fixed-conversation factory built for the
|
| 137 |
+
smoke (factorial / binary-search examples). Anyone using this in
|
| 138 |
+
real training is using example code as production.
|
| 139 |
+
- `ClaudeCodeIngester` — real, working component. Solid.
|
| 140 |
+
- `generalized_jsd_loss` — real, working (extracted from OPSD, MIT).
|
| 141 |
+
- `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but
|
| 142 |
+
require OpenRouter credentials + spend.
|
| 143 |
+
- `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass.
|
| 144 |
+
Useful only with `[train]` extra. Not exercised end-to-end on any
|
| 145 |
+
real model in this repo.
|
| 146 |
+
- `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra.
|
| 147 |
+
|
| 148 |
+
What is missing for "pip install and start training":
|
| 149 |
+
1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B.
|
| 150 |
+
2. No CLI. `pyproject.toml` declares no `[project.scripts]`.
|
| 151 |
+
3. No config schema (Hydra/Pydantic). Users hand-construct teacher
|
| 152 |
+
specs, hint generators, data collators.
|
| 153 |
+
4. The `[train]` extra pulls TRL but **no integration test** of
|
| 154 |
+
`ComposerReplicationTrainer` against a real GRPO rollout exists in
|
| 155 |
+
this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out
|
| 156 |
+
precisely to avoid TRL.
|
| 157 |
+
5. **`build_batch` should not be public API.** It belongs in
|
| 158 |
+
`examples/`. Re-exporting at top level implies it is a general-purpose
|
| 159 |
+
utility.
|
| 160 |
+
6. **Two sources of truth**: `composer_replication/loss.py` is a
|
| 161 |
+
near-copy of `spikes/006-…/compose_loss.py` with one import path
|
| 162 |
+
changed. The spike tests still import from the spike file. A bug fix
|
| 163 |
+
in one will not propagate. Same for `composer_diloco.py` ↔
|
| 164 |
+
`composer_replication/diloco/__init__.py`.
|
| 165 |
+
|
| 166 |
+
Real framework value:
|
| 167 |
+
- `ClaudeCodeIngester` with non-trivial logic.
|
| 168 |
+
- `generalized_jsd_loss` with token-clip + temperature.
|
| 169 |
+
- DiLoCo wrapper with sign-pinning test.
|
| 170 |
+
- Sane package layout with optional extras for heavy deps.
|
| 171 |
+
|
| 172 |
+
Net: **a successful directory restructure plus an installable wrapper
|
| 173 |
+
around three real components and two stubs.** Calling Wave 10 "framework
|
| 174 |
+
is installable with working entrypoints (✅)" is letter-of-the-law;
|
| 175 |
+
the brief's "framework" connotation isn't yet earned.
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## (c) ADR defensibility
|
| 180 |
+
|
| 181 |
+
### ADR-001 (local 5090 over Modal)
|
| 182 |
+
|
| 183 |
+
**Reasoning defensible; execution missing.** The
|
| 184 |
+
"iteration cycle 25–40s vs 3–5min" argument is concrete and matches
|
| 185 |
+
reality. The "verification smoke, not production" framing is correct.
|
| 186 |
+
|
| 187 |
+
**Gap**: Spike 002a-mini was never run on the 5090 either. Phase 10 in
|
| 188 |
+
DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal,
|
| 189 |
+
and **then nothing ran on either.** No `nvidia-smi` snapshot, no GPU
|
| 190 |
+
step-time CSV, no bf16 numerics check. The "rule out CPU-only blind
|
| 191 |
+
spots" goal is unmet. The ADR should be marked "Accepted (execution
|
| 192 |
+
deferred)" or the spike should run.
|
| 193 |
+
|
| 194 |
+
### ADR-002 (Claude Code JSONL trace source)
|
| 195 |
+
|
| 196 |
+
**Defensible on every dimension the ADR considers; the dimensions are
|
| 197 |
+
partial.** "1,015 real sessions, zero acquisition cost" is real. License
|
| 198 |
+
and schema-stability arguments are well-sourced.
|
| 199 |
+
|
| 200 |
+
**Adversarial counter not in the ADR**: Claude Code JSONL is the most
|
| 201 |
+
self-serving choice. The framework targets training a coding-agent model.
|
| 202 |
+
The training data is the author's own Claude Code sessions where the
|
| 203 |
+
agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and
|
| 204 |
+
*includes Claude*. So:
|
| 205 |
+
- "student action" = what Claude did.
|
| 206 |
+
- teacher pool includes Claude.
|
| 207 |
+
- DPO pairs = teachers' agreement vs Claude's literal text.
|
| 208 |
+
|
| 209 |
+
This is **circular imitation**: training a future model to imitate
|
| 210 |
+
Claude using Claude's outputs as the gold reference and Claude as one
|
| 211 |
+
of the disagreement teachers. The teacher-disagreement signal density
|
| 212 |
+
argument from Spike 001 is strongest with diverse teachers. With this
|
| 213 |
+
trace source, the student-action is locked to one teacher family,
|
| 214 |
+
biasing the disagreement signal. The ADR doesn't consider this; the
|
| 215 |
+
ingester README doesn't flag it. **The ADR rationalizes the easy path
|
| 216 |
+
without naming the data-leakage tradeoff.**
|
| 217 |
+
|
| 218 |
+
### ADR-003 (torchft for DiLoCo)
|
| 219 |
+
|
| 220 |
+
**Genuinely defensible choice.** Meta-maintained library; rolling-own
|
| 221 |
+
trap correctly identified; license analysis (rejecting `diloco_simple`)
|
| 222 |
+
is right; sign-convention risk named and tested.
|
| 223 |
+
|
| 224 |
+
**Gap is in delivery, not decision.** ADR-003 §Consequences §1 says:
|
| 225 |
+
"2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer
|
| 226 |
+
mock allreduce, assertions: replica equality after sync, params actually
|
| 227 |
+
moved, Nesterov state populated, sync count matches expected." Spike 008
|
| 228 |
+
implements one replica + passthrough manager. The ADR commits to an
|
| 229 |
+
implementation that the spike does not deliver, and the gap is flagged
|
| 230 |
+
only in the spike's verdict, not in the ADR.
|
| 231 |
+
|
| 232 |
+
If the recon doc said the pattern was "ready-to-paste" but actually
|
| 233 |
+
hits a sequencing bug, **the recon doc is wrong** and an adversarial
|
| 234 |
+
reviewer is allowed to point that out.
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
+
## (d) Scorecard inflation
|
| 239 |
+
|
| 240 |
+
The 5/10 → 9/10 update overstates. Test by test:
|
| 241 |
+
|
| 242 |
+
- **Test 6 (DiLoCo integrated in runnable stack) → ✅?**
|
| 243 |
+
Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and
|
| 244 |
+
fires on one replica. **Zero references to torchft or DiLoCo in
|
| 245 |
+
`composer_trainer.py`** — DiLoCo is not integrated with the trainer.
|
| 246 |
+
No two-replica integration test, no real distributed run.
|
| 247 |
+
|
| 248 |
+
- **Test 7 (real HF model loads + runs) → ✅?**
|
| 249 |
+
Yes — most legitimately closed item. Caveats from §(a) about depth
|
| 250 |
+
of evidence apply, but the literal test is met.
|
| 251 |
+
|
| 252 |
+
- **Test 8 (real LLM-application trace ingested end-to-end) → ✅?**
|
| 253 |
+
Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion
|
| 254 |
+
#3 ("end-to-end: real trace → ingester → collator → 1-step
|
| 255 |
+
`composer_total_loss`") is unmet.**
|
| 256 |
+
|
| 257 |
+
- **Test 9 (framework installable with working entrypoints) → ✅?**
|
| 258 |
+
Letter-of-law yes, spirit partial. `pip install -e .` works; the
|
| 259 |
+
quickstart runs the smoke harness. Production entrypoint
|
| 260 |
+
(`ComposerReplicationTrainer` driven by a config) does not exist.
|
| 261 |
+
|
| 262 |
+
- **Test 10 (non-author can complete the journey) → ✅?**
|
| 263 |
+
No. The supporting evidence is "Quickstart README + working
|
| 264 |
+
installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0."
|
| 265 |
+
Test 10's original journey was "I have Qwen3-7B, I want a
|
| 266 |
+
Composer-style variant." The parenthetical concession in the update
|
| 267 |
+
("For Qwen3-7B etc., GPU phase still gates the empirical demo")
|
| 268 |
+
✅'s the item anyway.
|
| 269 |
+
|
| 270 |
+
**Honest re-scoring**: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8),
|
| 271 |
+
2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points.
|
| 272 |
+
|
| 273 |
+
---
|
| 274 |
+
|
| 275 |
+
## (e) Commit quality
|
| 276 |
+
|
| 277 |
+
```
|
| 278 |
+
ac05fbf Wave 10 — packaging: composer_replication is now pip-installable
|
| 279 |
+
d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines)
|
| 280 |
+
a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo
|
| 281 |
+
57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8
|
| 282 |
+
ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
|
| 283 |
+
040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
- `ac05fbf`, `d52e126`: accurate.
|
| 287 |
+
- `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did
|
| 288 |
+
not actually run cleanly for anyone cloning before this commit. Mild
|
| 289 |
+
overclaim risk on `57af35d`.
|
| 290 |
+
- **`57af35d` is the single most overclaiming commit.** Title: "close
|
| 291 |
+
vision-validation gaps V2/V5/V8."
|
| 292 |
+
- V8: closed in the weakest sense (tautology critique above).
|
| 293 |
+
- V5: structural ingestion closes; BACKLOG acceptance #3 unmet.
|
| 294 |
+
- V2: silently re-defined (one replica, no convergence).
|
| 295 |
+
Three closures claimed; one partial, one redefined.
|
| 296 |
+
- **Chronology problem**: `040eff8` (Wave 6) declared the **5/10 → 9/10
|
| 297 |
+
forecast** in the commit subject. `ac4bfb4` (Wave 7, *next* commit)
|
| 298 |
+
added the BACKLOG and ADRs — i.e., the *plan* to make the forecast
|
| 299 |
+
true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without
|
| 300 |
+
re-auditing whether each item was actually closed in spirit. **No
|
| 301 |
+
commit re-audits the scorecard against actually delivered evidence.**
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## (f) Adversarial reviewer's strongest line of attack
|
| 306 |
+
|
| 307 |
+
> "You have a research replication framework whose only published smoke
|
| 308 |
+
> is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO
|
| 309 |
+
> channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel
|
| 310 |
+
> uses dummy reference logprobs, and the GRPO channel is replaced with
|
| 311 |
+
> a stub. Of the three channels you advertise, **zero are tested
|
| 312 |
+
> end-to-end on a real HF model.** Your DiLoCo integration is one
|
| 313 |
+
> replica with a no-op `allreduce`. Your real-trace ingester is tested
|
| 314 |
+
> against a fixture you wrote yourself plus a hardcoded path on your
|
| 315 |
+
> laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no
|
| 316 |
+
> third-party validation, and one commit that closed three vision-
|
| 317 |
+
> validation gaps with one commit message. You are asking the reader to
|
| 318 |
+
> believe that a $9B-startup commercial product is replicated by a CPU
|
| 319 |
+
> smoke and three green test files — none of which the company itself
|
| 320 |
+
> would call 'replicated.'"
|
| 321 |
+
|
| 322 |
+
**Weakest defense**: "It's just v0.1 / smoke phase / GPU is the next
|
| 323 |
+
phase." The *commit log and scorecard claim otherwise.* The defense
|
| 324 |
+
"v0.1 caveat" only works if the v0.1 framing is honest at the top of
|
| 325 |
+
the README and scorecard — and it is not.
|
| 326 |
+
|
| 327 |
+
**Strongest actual defense**: the four primary-source-validated recon
|
| 328 |
+
docs and Spike 001's measured cost floor. The *thesis* is credible and
|
| 329 |
+
auditable. The *implementation phase* is overclaimed.
|
| 330 |
+
|
| 331 |
+
---
|
| 332 |
+
|
| 333 |
+
## What to fix before publishing publicly (priority order)
|
| 334 |
+
|
| 335 |
+
### 1. Re-state the scorecard honestly (BLOCKER)
|
| 336 |
+
Replace 5/10 → 9/10 with **5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
|
| 337 |
+
List the spirit-failures explicitly (test 6 trainer integration, test 8
|
| 338 |
+
end-to-end, test 10 non-author). Single most important fix; everything
|
| 339 |
+
else compounds on the inflated scorecard.
|
| 340 |
+
|
| 341 |
+
### 2. Fix Spike 008's V2 claim (BLOCKER)
|
| 342 |
+
Either (a) add a real two-replica multiprocessing test (ADR-003 says
|
| 343 |
+
this is feasible; the spike claims it isn't — reconcile), or (b) mark
|
| 344 |
+
V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica,
|
| 345 |
+
sign convention pinned; cross-replica convergence deferred to GPU
|
| 346 |
+
phase." Pick one.
|
| 347 |
+
|
| 348 |
+
### 3. Strengthen Spike 006 against the tautology critique
|
| 349 |
+
Two cheap wins:
|
| 350 |
+
- Test that loss decreases on **two alternating fixed batches** over 10
|
| 351 |
+
rounds (not just one memorized batch).
|
| 352 |
+
- Test where **`alpha_sdpo=10.0` and SDPO actually fires** (truncate
|
| 353 |
+
ctx_teacher to T_s tokens for matching shape). The SDPO channel is
|
| 354 |
+
*not exercised on a real HF model anywhere* in the codebase. Largest
|
| 355 |
+
evidence gap for V8.
|
| 356 |
+
|
| 357 |
+
### 4. Run Spike 002a-mini on the local 5090
|
| 358 |
+
ADR-001 made the choice; the spike was not run. Either drop the ADR
|
| 359 |
+
(decision deferred) or run the spike (~30 min wall-clock per ADR's own
|
| 360 |
+
estimate). Until then, the framework has zero GPU evidence of any kind.
|
| 361 |
+
|
| 362 |
+
### 5. Fix the run.log / verdict.md numerical inconsistency
|
| 363 |
+
Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090.
|
| 364 |
+
Either pin the seed properly or document non-reproducibility and quote
|
| 365 |
+
a band rather than exact numbers.
|
| 366 |
+
|
| 367 |
+
### 6. Acknowledge Claude Code JSONL's circularity in ADR-002
|
| 368 |
+
Add a "Risks accepted" entry naming the data-leakage concern: training
|
| 369 |
+
on Claude's outputs while Claude is in the teacher pool produces a
|
| 370 |
+
biased disagreement signal. Spike 007 README should also flag it.
|
| 371 |
+
|
| 372 |
+
### 7. Decide what `compose_loss` and `build_batch` are
|
| 373 |
+
Either rename to `compose_loss_smoke` (and keep
|
| 374 |
+
`ComposerReplicationTrainer._compute_loss` as production), or make
|
| 375 |
+
`compose_loss` actually production-grade and demote `build_batch` out
|
| 376 |
+
of public API. Production-disclaimed harness as the package's headline
|
| 377 |
+
import is confusing.
|
| 378 |
+
|
| 379 |
+
### 8. Eliminate dual sources of truth
|
| 380 |
+
`spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and
|
| 381 |
+
`spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`.
|
| 382 |
+
Make the spike import from the package; delete the duplicate.
|
| 383 |
+
|
| 384 |
+
### 9. Add the missing real-trace end-to-end test in Spike 007
|
| 385 |
+
Take ingester output → Spike 005 data collator → 1 step of `compose_loss`.
|
| 386 |
+
This is BACKLOG acceptance #3; ~50 lines of test code closes V5's
|
| 387 |
+
spirit gap.
|
| 388 |
+
|
| 389 |
+
### 10. Fix the sign-convention docstring in `composer_diloco.py`
|
| 390 |
+
Replace the incoherent "wrong-sign + SGD subtract = right answer with
|
| 391 |
+
momentum" gloss with: *"DiLoCo defines pseudo-gradient as
|
| 392 |
+
`θ_initial − θ_local`; this is the negative of the local update
|
| 393 |
+
direction, and standard SGD subtracts gradients, so the outer step
|
| 394 |
+
moves in the local-update direction. No negation required."* The test
|
| 395 |
+
is correct; the prose explaining it isn't.
|
| 396 |
+
|
| 397 |
+
---
|
| 398 |
+
|
| 399 |
+
## Credit where due
|
| 400 |
+
|
| 401 |
+
- **Spike 007's `ClaudeCodeIngester`** is real, working, well-tested
|
| 402 |
+
software with non-trivial logic (sidechain skip, thinking-block
|
| 403 |
+
strip-on-replay, malformed-line tolerance). The synthetic fixture
|
| 404 |
+
exercises the structural cases properly.
|
| 405 |
+
- **Spike 008's pseudogradient-sign-convention test** is the single
|
| 406 |
+
best test in all of Wave 7-10. It pins a known torchft hazard with an
|
| 407 |
+
explicit arithmetic prediction and a `wrong_sign_diff` reported on
|
| 408 |
+
failure.
|
| 409 |
+
- **Spike 006's α=0 / β=0 ablation tests** would catch real regressions
|
| 410 |
+
and document channel-disable semantics.
|
| 411 |
+
- **All three ADRs are properly traceable to recon documents**
|
| 412 |
+
(MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE,
|
| 413 |
+
DILOCO_RECONNAISSANCE). The decisions can be challenged; the *process*
|
| 414 |
+
is auditable, which is rare.
|
| 415 |
+
- **Package layout** (`loss`, `batch`, `opsd`, `teacher_replay`,
|
| 416 |
+
`ingestion/claude_code`, `diloco`, `trainer`) is sane; optional
|
| 417 |
+
extras correctly avoid forcing TRL/torchft on every install.
|
| 418 |
+
|
| 419 |
+
The work product is not zero. It is overclaimed by roughly one
|
| 420 |
+
scorecard tier and one BACKLOG acceptance criterion. Fixing items
|
| 421 |
+
1, 2, 3, 5 above moves the framework from "publishable with a generous
|
| 422 |
+
reviewer" to "publishable with a critical reviewer." Items 4 and 6
|
| 423 |
+
move it from "research replication" to "evidenced research replication."
|
|
@@ -11,21 +11,25 @@ Wraps `torchft.local_sgd.DiLoCo` with the framework's conventions:
|
|
| 11 |
Reference: `docs/adrs/ADR-003-diloco-impl.md`.
|
| 12 |
|
| 13 |
Sign convention (READ THIS BEFORE TOUCHING):
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
"""
|
| 30 |
from __future__ import annotations
|
| 31 |
|
|
|
|
| 11 |
Reference: `docs/adrs/ADR-003-diloco-impl.md`.
|
| 12 |
|
| 13 |
Sign convention (READ THIS BEFORE TOUCHING):
|
| 14 |
+
DiLoCo defines pseudo-gradient as
|
| 15 |
+
|
| 16 |
+
pseudograd = θ_initial - θ_local
|
| 17 |
+
|
| 18 |
+
(per torchft's `_save_grads()` at line 324 of `torchft/local_sgd.py`).
|
| 19 |
+
This is the **negative** of the local update direction.
|
| 20 |
+
|
| 21 |
+
Standard SGD subtracts gradients: `p.data ← p.data - lr * grad`.
|
| 22 |
+
So when the outer optimizer runs after `restore_parameters()` puts
|
| 23 |
+
p.data back to θ_initial:
|
| 24 |
+
|
| 25 |
+
p.data ← θ_initial - lr * (θ_initial - θ_local)
|
| 26 |
+
= θ_initial + lr * (θ_local - θ_initial)
|
| 27 |
+
|
| 28 |
+
For `lr=1, momentum=0` this lands exactly at θ_local. For `lr<1` it
|
| 29 |
+
interpolates between θ_initial and θ_local. Nesterov momentum
|
| 30 |
+
accumulates the local-update direction across outer rounds.
|
| 31 |
+
|
| 32 |
+
No negation in our outer optimizer wrapper.
|
| 33 |
"""
|
| 34 |
from __future__ import annotations
|
| 35 |
|
|
@@ -1,29 +1,62 @@
|
|
| 1 |
# Spike 008 — VERDICT
|
| 2 |
|
| 3 |
-
**Status**:
|
| 4 |
**Date**: 2026-05-26
|
| 5 |
**Wave**: 9
|
|
|
|
| 6 |
|
| 7 |
## Headline
|
| 8 |
|
| 9 |
`make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained)
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
| Criterion | Status |
|
| 17 |
|---|---|
|
| 18 |
-
|
|
| 19 |
| Nesterov momentum state populated for every parameter | ✅ test 1 |
|
| 20 |
-
| Pseudo-gradient sign convention verified (`θ_initial − θ_local`) | ✅ test 2 |
|
| 21 |
| No regression in Spike 005 imports | ✅ test 3 |
|
| 22 |
| `make_diloco_outer_loop()` factory wraps the right object | ✅ test 4 |
|
| 23 |
-
| Streaming DiLoCo with 2 fragments
|
| 24 |
-
| Spike 005's 38 tests still pass | ✅ verified separately |
|
| 25 |
|
| 26 |
-
## Sign convention pinned
|
| 27 |
|
| 28 |
Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
|
| 29 |
|
|
@@ -31,43 +64,43 @@ Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
|
|
| 31 |
pseudograd = θ_initial − θ_local
|
| 32 |
```
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
## What this
|
| 44 |
|
| 45 |
-
- **V2** (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md` —
|
| 46 |
-
|
| 47 |
-
|
| 48 |
|
| 49 |
## What this does NOT close
|
| 50 |
|
| 51 |
-
- True multi-replica convergence
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
- Streaming DiLoCo with `fragment_sync_delay > 0` and overlapped sync
|
| 63 |
-
(requires CUDA streams). The framework's `make_diloco_outer_loop()` accepts
|
| 64 |
-
the parameter; Spike 008 exercises only `delay=0` (vanilla DiLoCo).
|
| 65 |
|
| 66 |
## Files
|
| 67 |
|
| 68 |
-
- `composer_diloco.py` — `make_diloco_outer_loop()` wrapper. Documents
|
| 69 |
-
sign convention
|
| 70 |
-
- `tests/test_diloco_smoke.py` — 5 acceptance tests.
|
|
|
|
| 71 |
|
| 72 |
## Dependencies added
|
| 73 |
|
|
@@ -76,4 +109,4 @@ values in the failure message so a future flip is immediately diagnosable.
|
|
| 76 |
## Cost / time
|
| 77 |
|
| 78 |
- Pure CPU, single process, no GPU.
|
| 79 |
-
- Test suite:
|
|
|
|
| 1 |
# Spike 008 — VERDICT
|
| 2 |
|
| 3 |
+
**Status**: ⚠️ PARTIAL (acceptance criterion redefined; see § "Honest re-statement" below)
|
| 4 |
**Date**: 2026-05-26
|
| 5 |
**Wave**: 9
|
| 6 |
+
**Cross-model review**: BLOCKER 2 of `docs/research/WAVE_7_10_FINAL_REVIEW.md`
|
| 7 |
|
| 8 |
## Headline
|
| 9 |
|
| 10 |
`make_diloco_outer_loop()` wraps `torchft.local_sgd.DiLoCo` (BSD-3, Meta-maintained)
|
| 11 |
+
and the framework's outer-loop sign convention is pinned by an explicit unit
|
| 12 |
+
test. **Cross-replica convergence is NOT verified** — the BACKLOG acceptance
|
| 13 |
+
criterion required two replicas; what shipped is one replica + passthrough
|
| 14 |
+
no-op `allreduce`.
|
| 15 |
+
|
| 16 |
+
## Honest re-statement
|
| 17 |
+
|
| 18 |
+
The BACKLOG required:
|
| 19 |
+
|
| 20 |
+
> Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model
|
| 21 |
+
> from Spike 005, both replicas converge toward the same solution within
|
| 22 |
+
> tolerance.
|
| 23 |
+
|
| 24 |
+
What was attempted: the recon doc (`DILOCO_RECONNAISSANCE.md`) provided a
|
| 25 |
+
"ready-to-paste" 2-replica pattern with a shared-buffer mock allreduce that
|
| 26 |
+
averages tensors across replicas. **That pattern does not work in single-process**:
|
| 27 |
+
each `inner.step()`'s post-hook runs `prepare_sync` + `perform_sync` to
|
| 28 |
+
completion (including outer optimizer step) before yielding back to the
|
| 29 |
+
caller. By the time replica B's post-hook starts, replica A has already
|
| 30 |
+
finished its outer step using A's *un*-averaged pseudo-gradient. The mock
|
| 31 |
+
allreduce can compute the cross-replica mean, but it can't write that mean
|
| 32 |
+
back into A's `_grads[name]` buffer in time for A's outer step.
|
| 33 |
+
|
| 34 |
+
The fix would require either:
|
| 35 |
+
- A real `torch.distributed` barrier (NCCL or Gloo) — out of scope for a
|
| 36 |
+
CPU-only single-process smoke.
|
| 37 |
+
- A multi-process test using `torch.multiprocessing.spawn` with two real
|
| 38 |
+
processes — feasible but ~200 LOC of additional test infrastructure that
|
| 39 |
+
would need its own review.
|
| 40 |
+
|
| 41 |
+
What ships instead: a single-replica machinery test (`allreduce` is a no-op
|
| 42 |
+
passthrough). Verifies that outer optimizer fires, Nesterov state populates,
|
| 43 |
+
sign convention is correct. Does NOT verify cross-replica convergence.
|
| 44 |
+
**This is a redefinition of the BACKLOG acceptance criterion.** Documented
|
| 45 |
+
explicitly in this verdict + the test file's `_make_passthrough_manager`
|
| 46 |
+
docstring + composer_diloco.py.
|
| 47 |
+
|
| 48 |
+
## What the test suite DOES verify (5/5 pass)
|
| 49 |
|
| 50 |
| Criterion | Status |
|
| 51 |
|---|---|
|
| 52 |
+
| `allreduce`/`start_quorum`/`should_commit` fire at the right step boundaries | ✅ test 1 |
|
| 53 |
| Nesterov momentum state populated for every parameter | ✅ test 1 |
|
| 54 |
+
| **Pseudo-gradient sign convention** verified (`θ_initial − θ_local`) with explicit arithmetic prediction | ✅ test 2 |
|
| 55 |
| No regression in Spike 005 imports | ✅ test 3 |
|
| 56 |
| `make_diloco_outer_loop()` factory wraps the right object | ✅ test 4 |
|
| 57 |
+
| Streaming DiLoCo with 2 fragments + nonzero `fragment_sync_delay` accepts the config | ✅ test 5 |
|
|
|
|
| 58 |
|
| 59 |
+
## Sign convention pinned (the most important result here)
|
| 60 |
|
| 61 |
Per torchft's `_save_grads()` (line 324 of `torchft/local_sgd.py`):
|
| 62 |
|
|
|
|
| 64 |
pseudograd = θ_initial − θ_local
|
| 65 |
```
|
| 66 |
|
| 67 |
+
DiLoCo defines pseudo-gradient as `θ_initial - θ_local`. This is the
|
| 68 |
+
negative of the local update direction. Standard SGD subtracts gradients
|
| 69 |
+
(`p ← p - lr * grad`), so the outer step moves in the local-update
|
| 70 |
+
direction. No negation needed in our outer optimizer wrapper.
|
| 71 |
|
| 72 |
+
The test exercises this exact math with `local_param_after_nudge =
|
| 73 |
+
θ_initial + 0.5` and asserts final ≈ `θ_local`. A sign flip in either
|
| 74 |
+
`_save_grads` or the outer optimizer would land at `θ_initial - 0.5`
|
| 75 |
+
(movement in the wrong direction); the test reports both values in the
|
| 76 |
+
failure message so a future flip is immediately diagnosable. **This is
|
| 77 |
+
the single best test in Wave 7-10** per the cross-model reviewer.
|
| 78 |
|
| 79 |
+
## What this CLAIMS to close
|
| 80 |
|
| 81 |
+
- **V2** (DiLoCo "deferred to v0.2") in `docs/VISION_VALIDATION.md` —
|
| 82 |
+
re-scored as **⚠️ partial**, not ✅, in the 2026-05-26 update at the
|
| 83 |
+
bottom of § 3 of that doc.
|
| 84 |
|
| 85 |
## What this does NOT close
|
| 86 |
|
| 87 |
+
- **True multi-replica convergence** — see § "Honest re-statement" above.
|
| 88 |
+
Either needs a real `torch.distributed` test on multiple processes, or
|
| 89 |
+
a redesigned single-process pattern that overrides `_DummyWork.wait()`
|
| 90 |
+
to do lazy averaging.
|
| 91 |
+
- **Trainer integration** — `ComposerReplicationTrainer` does NOT yet use
|
| 92 |
+
`make_diloco_outer_loop`. The DiLoCo wrapper is an independent context
|
| 93 |
+
manager. Wiring it into the trainer's lifecycle is a separate spike.
|
| 94 |
+
- **Streaming DiLoCo with `fragment_sync_delay > 0`** (overlapped sync,
|
| 95 |
+
CUDA streams). The framework's `make_diloco_outer_loop()` accepts the
|
| 96 |
+
parameter; tests only exercise `delay=0` (vanilla DiLoCo).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
## Files
|
| 99 |
|
| 100 |
+
- `composer_diloco.py` — `make_diloco_outer_loop()` wrapper. Documents
|
| 101 |
+
the sign convention.
|
| 102 |
+
- `tests/test_diloco_smoke.py` — 5 acceptance tests. Test 2 (sign
|
| 103 |
+
convention) is the highest-value test.
|
| 104 |
|
| 105 |
## Dependencies added
|
| 106 |
|
|
|
|
| 109 |
## Cost / time
|
| 110 |
|
| 111 |
- Pure CPU, single process, no GPU.
|
| 112 |
+
- Test suite: ~5 seconds for 5 tests.
|