Wave 11: cross-model adversarial review + honest down-revision

f16fa23 13 days ago

20.5 kB

	# Wave 7–10 Final Review — Cross-model adversarial check

	Reviewer: external model, Phase 11 of the deep work loop.
	Date: 2026-05-26.
	Mandate: find substantive flaws. The research thesis is
	primary-source-validated; this attacks implementation correctness and
	scope creep, not the thesis.

	---

	## (a) Are the tests real evidence or theater?

	### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests)

	Verdict: mostly tautology, with usable ablation tests.

	The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is
	technically true and substantively near-tautological:

	1. The same fixed ~50-token batch is reused for all 5 steps.
	`build_batch` returns one conversation; the test loop calls
	`compose_loss(model, batch)` five times in a row. No reshuffle, no
	second batch, no held-out anything.
	2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch ×
	5 steps = textbook memorization regime.** A randomly-initialized MLP
	would also reduce loss in this setup. The test does not distinguish
	"the 3-channel composition is correct" from "AdamW reduces fixed-batch
	loss on any non-degenerate objective."
	3. The SDPO channel is zero throughout (`sdpo_jsd=0.0` on every
	row of `loss_curve.csv`). The verdict calls this "correct fallback
	behavior"; what it actually is is *the entire SDPO channel never
	being tested by this smoke*. The fallback is a literal `_zero(device)`.
	`generalized_jsd_loss` has no end-to-end test on a real HF model
	anywhere in the codebase. **This is the largest evidence gap for
	V8.**
	4. DPO uses dummy hard-coded reference logprobs (`-30.0`, `-35.0`).
	This tests that `-logsigmoid(small_positive)` is differentiable, not
	that the trace-replay-DPO pipeline (reference-policy precompute +
	collator + loss) wires together.
	5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the
	weakest version of monotonicity.

	Genuine value: model-loads test, chat-template test, the three
	α=0 / β=0 ablation tests. The ablations would catch a regression where
	weights stop disabling channels. The "5-step decrease" is the weakest
	test in the file.

	Run.log inconsistency, not flagged anywhere:
	`examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379;
	the spike `verdict.md` quotes step-1 total = 0.2090 for the same code,
	same model. Either the seed isn't pinned through the model forward
	(likely — `torch.manual_seed(42)` is in `build_batch` only), or the
	package's `compose_loss` differs subtly from the spike's. **Quoting
	exact numbers from a non-reproducible run as evidence is the sloppy
	version of every research-replication scandal.**

	### Spike 007 (Claude Code ingester, 15 tests)

	Verdict: strongest test suite of the three. Caveats apply.

	Real engineering value:
	- Synthetic fixture exercises the actual record types (assistant,
	user/tool_result, summary, system, sidechain). Tests assert structural
	properties: history grows monotonically, `[THINKING]` stripped on
	replay but kept in student_action, unique state_ids, tool_use
	serialization, tool_result tagging.
	- `test_truncated_line_tolerated` would catch a real failure-mode
	removal of the JSON-decode try/except.
	- Subagent and sidechain skip tests catch real production cases.

	Caveats:
	- **The "real session" test is hardcoded to one path on the author's
	machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`).
	No env var, no fixture-discovery; the test is `skipif(not exists)`.
	This is a manual integration test, not a CI test. ADR-002 said "CI
	users substitute their own"; the substitution mechanism doesn't
	exist.
	- The synthetic fixture is author-written and presumably designed
	alongside the ingester. There is no scrubbed third-party fixture.
	- Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace →
	ingester → collator → 1-step `composer_total_loss`") is unmet —
	the spike stops at "ingester emits TraceStates correctly." There is
	no test that takes ingested records, runs the data collator, and runs
	through `compose_loss`.

	This suite would catch real regressions in the ingester. Its weakest
	property: ships no contributor-runnable real-trace test.

	### Spike 008 (DiLoCo, 5 tests, single-process)

	**Verdict: the caveat is honest but says the test does not test what
	users will assume it tests.**

	BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps
	× 2 outer rounds on the toy model from Spike 005, both replicas converge
	toward the same solution within tolerance."*

	What ships: one replica, mock manager whose `allreduce` is a
	`passthrough` no-op (test_diloco_smoke.py:78). This is "one
	replica's outer optimizer machinery fires," not "two replicas
	converge." The acceptance criterion was silently re-defined; the
	spike's verdict.md calls this a "limitation" but it is a redefinition.

	The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real
	shared-buffer averaging. The implementation hits a "post-hook
	sequencing bug." **One of the recon claim and the implementation is
	wrong**, and the gap is buried in verdict.md instead of fixed.

	Genuine value: `test_diloco_pseudogradient_sign_convention` is the
	single best test in all of Wave 7-10. It pins the sign convention
	with a concrete arithmetic prediction (`final == θ_initial + nudge`)
	and reports `wrong_sign_diff` on failure. A future torchft upgrade that
	flips the sign breaks this test loudly. ADR-003 specifically flagged
	this hazard, and the test catches it. Credit where due.

	Separate flaw in `composer_diloco.py` docstring (lines 13–28): the
	"wrong-sign pseudogradient combined with SGD's subtract-grad semantics
	gives net step in the local-Δ direction once momentum builds up" gloss
	is incoherent. There is no "wrong-sign" pseudogradient.
	`θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's
	`p ← p − lr·g` semantics are designed for it. The test is correct; the
	prose explaining why is wrong, and will mislead anyone porting the
	convention.

	---

	## (b) Is the package a real framework or a shim?

	**Verdict: a structured shim around three real components and two
	stubs. Not yet a framework.**

	What `pip install composer-replication` delivers:
	- `compose_loss` — labeled in its own top docstring as "Do NOT use as
	the production training loss." Re-exported as the headline package
	API and used in the quickstart.
	- `build_batch` — a hard-coded fixed-conversation factory built for the
	smoke (factorial / binary-search examples). Anyone using this in
	real training is using example code as production.
	- `ClaudeCodeIngester` — real, working component. Solid.
	- `generalized_jsd_loss` — real, working (extracted from OPSD, MIT).
	- `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but
	require OpenRouter credentials + spend.
	- `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass.
	Useful only with `[train]` extra. Not exercised end-to-end on any
	real model in this repo.
	- `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra.

	What is missing for "pip install and start training":
	1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B.
	2. No CLI. `pyproject.toml` declares no `[project.scripts]`.
	3. No config schema (Hydra/Pydantic). Users hand-construct teacher
	specs, hint generators, data collators.
	4. The `[train]` extra pulls TRL but no integration test of
	`ComposerReplicationTrainer` against a real GRPO rollout exists in
	this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out
	precisely to avoid TRL.
	5. `build_batch` should not be public API. It belongs in
	`examples/`. Re-exporting at top level implies it is a general-purpose
	utility.
	6. Two sources of truth: `composer_replication/loss.py` is a
	near-copy of `spikes/006-…/compose_loss.py` with one import path
	changed. The spike tests still import from the spike file. A bug fix
	in one will not propagate. Same for `composer_diloco.py` ↔
	`composer_replication/diloco/__init__.py`.

	Real framework value:
	- `ClaudeCodeIngester` with non-trivial logic.
	- `generalized_jsd_loss` with token-clip + temperature.
	- DiLoCo wrapper with sign-pinning test.
	- Sane package layout with optional extras for heavy deps.

	Net: **a successful directory restructure plus an installable wrapper
	around three real components and two stubs.** Calling Wave 10 "framework
	is installable with working entrypoints (✅)" is letter-of-the-law;
	the brief's "framework" connotation isn't yet earned.

	---

	## (c) ADR defensibility

	### ADR-001 (local 5090 over Modal)

	Reasoning defensible; execution missing. The
	"iteration cycle 25–40s vs 3–5min" argument is concrete and matches
	reality. The "verification smoke, not production" framing is correct.

	Gap: Spike 002a-mini was never run on the 5090 either. Phase 10 in
	DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal,
	and then nothing ran on either. No `nvidia-smi` snapshot, no GPU
	step-time CSV, no bf16 numerics check. The "rule out CPU-only blind
	spots" goal is unmet. The ADR should be marked "Accepted (execution
	deferred)" or the spike should run.

	### ADR-002 (Claude Code JSONL trace source)

	**Defensible on every dimension the ADR considers; the dimensions are
	partial.** "1,015 real sessions, zero acquisition cost" is real. License
	and schema-stability arguments are well-sourced.

	Adversarial counter not in the ADR: Claude Code JSONL is the most
	self-serving choice. The framework targets training a coding-agent model.
	The training data is the author's own Claude Code sessions where the
	agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and
	includes Claude. So:
	- "student action" = what Claude did.
	- teacher pool includes Claude.
	- DPO pairs = teachers' agreement vs Claude's literal text.

	This is circular imitation: training a future model to imitate
	Claude using Claude's outputs as the gold reference and Claude as one
	of the disagreement teachers. The teacher-disagreement signal density
	argument from Spike 001 is strongest with diverse teachers. With this
	trace source, the student-action is locked to one teacher family,
	biasing the disagreement signal. The ADR doesn't consider this; the
	ingester README doesn't flag it. **The ADR rationalizes the easy path
	without naming the data-leakage tradeoff.**

	### ADR-003 (torchft for DiLoCo)

	Genuinely defensible choice. Meta-maintained library; rolling-own
	trap correctly identified; license analysis (rejecting `diloco_simple`)
	is right; sign-convention risk named and tested.

	Gap is in delivery, not decision. ADR-003 §Consequences §1 says:
	"2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer
	mock allreduce, assertions: replica equality after sync, params actually
	moved, Nesterov state populated, sync count matches expected." Spike 008
	implements one replica + passthrough manager. The ADR commits to an
	implementation that the spike does not deliver, and the gap is flagged
	only in the spike's verdict, not in the ADR.

	If the recon doc said the pattern was "ready-to-paste" but actually
	hits a sequencing bug, the recon doc is wrong and an adversarial
	reviewer is allowed to point that out.

	---

	## (d) Scorecard inflation

	The 5/10 → 9/10 update overstates. Test by test:

	- Test 6 (DiLoCo integrated in runnable stack) → ✅?
	Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and
	fires on one replica. **Zero references to torchft or DiLoCo in
	`composer_trainer.py`** — DiLoCo is not integrated with the trainer.
	No two-replica integration test, no real distributed run.

	- Test 7 (real HF model loads + runs) → ✅?
	Yes — most legitimately closed item. Caveats from §(a) about depth
	of evidence apply, but the literal test is met.

	- Test 8 (real LLM-application trace ingested end-to-end) → ✅?
	Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion
	#3 ("end-to-end: real trace → ingester → collator → 1-step
	`composer_total_loss`") is unmet.**

	- Test 9 (framework installable with working entrypoints) → ✅?
	Letter-of-law yes, spirit partial. `pip install -e .` works; the
	quickstart runs the smoke harness. Production entrypoint
	(`ComposerReplicationTrainer` driven by a config) does not exist.

	- Test 10 (non-author can complete the journey) → ✅?
	No. The supporting evidence is "Quickstart README + working
	installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0."
	Test 10's original journey was "I have Qwen3-7B, I want a
	Composer-style variant." The parenthetical concession in the update
	("For Qwen3-7B etc., GPU phase still gates the empirical demo")
	✅'s the item anyway.

	Honest re-scoring: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8),
	2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points.

	---

	## (e) Commit quality

	```
	ac05fbf Wave 10 — packaging: composer_replication is now pip-installable
	d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines)
	a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo
	57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8
	ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
	040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)
	```

	- `ac05fbf`, `d52e126`: accurate.
	- `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did
	not actually run cleanly for anyone cloning before this commit. Mild
	overclaim risk on `57af35d`.
	- `57af35d` is the single most overclaiming commit. Title: "close
	vision-validation gaps V2/V5/V8."
	- V8: closed in the weakest sense (tautology critique above).
	- V5: structural ingestion closes; BACKLOG acceptance #3 unmet.
	- V2: silently re-defined (one replica, no convergence).
	Three closures claimed; one partial, one redefined.
	- Chronology problem: `040eff8` (Wave 6) declared the **5/10 → 9/10
	forecast** in the commit subject. `ac4bfb4` (Wave 7, next commit)
	added the BACKLOG and ADRs — i.e., the plan to make the forecast
	true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without
	re-auditing whether each item was actually closed in spirit. **No
	commit re-audits the scorecard against actually delivered evidence.**

	---

	## (f) Adversarial reviewer's strongest line of attack

	> "You have a research replication framework whose only published smoke
	> is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO
	> channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel
	> uses dummy reference logprobs, and the GRPO channel is replaced with
	> a stub. Of the three channels you advertise, **zero are tested
	> end-to-end on a real HF model.** Your DiLoCo integration is one
	> replica with a no-op `allreduce`. Your real-trace ingester is tested
	> against a fixture you wrote yourself plus a hardcoded path on your
	> laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no
	> third-party validation, and one commit that closed three vision-
	> validation gaps with one commit message. You are asking the reader to
	> believe that a $9B-startup commercial product is replicated by a CPU
	> smoke and three green test files — none of which the company itself
	> would call 'replicated.'"

	Weakest defense: "It's just v0.1 / smoke phase / GPU is the next
	phase." The commit log and scorecard claim otherwise. The defense
	"v0.1 caveat" only works if the v0.1 framing is honest at the top of
	the README and scorecard — and it is not.

	Strongest actual defense: the four primary-source-validated recon
	docs and Spike 001's measured cost floor. The thesis is credible and
	auditable. The implementation phase is overclaimed.

	---

	## What to fix before publishing publicly (priority order)

	### 1. Re-state the scorecard honestly (BLOCKER)
	Replace 5/10 → 9/10 with 5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.
	List the spirit-failures explicitly (test 6 trainer integration, test 8
	end-to-end, test 10 non-author). Single most important fix; everything
	else compounds on the inflated scorecard.

	### 2. Fix Spike 008's V2 claim (BLOCKER)
	Either (a) add a real two-replica multiprocessing test (ADR-003 says
	this is feasible; the spike claims it isn't — reconcile), or (b) mark
	V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica,
	sign convention pinned; cross-replica convergence deferred to GPU
	phase." Pick one.

	### 3. Strengthen Spike 006 against the tautology critique
	Two cheap wins:
	- Test that loss decreases on two alternating fixed batches over 10
	rounds (not just one memorized batch).
	- Test where `alpha_sdpo=10.0` and SDPO actually fires (truncate
	ctx_teacher to T_s tokens for matching shape). The SDPO channel is
	not exercised on a real HF model anywhere in the codebase. Largest
	evidence gap for V8.

	### 4. Run Spike 002a-mini on the local 5090
	ADR-001 made the choice; the spike was not run. Either drop the ADR
	(decision deferred) or run the spike (~30 min wall-clock per ADR's own
	estimate). Until then, the framework has zero GPU evidence of any kind.

	### 5. Fix the run.log / verdict.md numerical inconsistency
	Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090.
	Either pin the seed properly or document non-reproducibility and quote
	a band rather than exact numbers.

	### 6. Acknowledge Claude Code JSONL's circularity in ADR-002
	Add a "Risks accepted" entry naming the data-leakage concern: training
	on Claude's outputs while Claude is in the teacher pool produces a
	biased disagreement signal. Spike 007 README should also flag it.

	### 7. Decide what `compose_loss` and `build_batch` are
	Either rename to `compose_loss_smoke` (and keep
	`ComposerReplicationTrainer._compute_loss` as production), or make
	`compose_loss` actually production-grade and demote `build_batch` out
	of public API. Production-disclaimed harness as the package's headline
	import is confusing.

	### 8. Eliminate dual sources of truth
	`spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and
	`spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`.
	Make the spike import from the package; delete the duplicate.

	### 9. Add the missing real-trace end-to-end test in Spike 007
	Take ingester output → Spike 005 data collator → 1 step of `compose_loss`.
	This is BACKLOG acceptance #3; ~50 lines of test code closes V5's
	spirit gap.

	### 10. Fix the sign-convention docstring in `composer_diloco.py`
	Replace the incoherent "wrong-sign + SGD subtract = right answer with
	momentum" gloss with: *"DiLoCo defines pseudo-gradient as
	`θ_initial − θ_local`; this is the negative of the local update
	direction, and standard SGD subtracts gradients, so the outer step
	moves in the local-update direction. No negation required."* The test
	is correct; the prose explaining it isn't.

	---

	## Credit where due

	- Spike 007's `ClaudeCodeIngester` is real, working, well-tested
	software with non-trivial logic (sidechain skip, thinking-block
	strip-on-replay, malformed-line tolerance). The synthetic fixture
	exercises the structural cases properly.
	- Spike 008's pseudogradient-sign-convention test is the single
	best test in all of Wave 7-10. It pins a known torchft hazard with an
	explicit arithmetic prediction and a `wrong_sign_diff` reported on
	failure.
	- Spike 006's α=0 / β=0 ablation tests would catch real regressions
	and document channel-disable semantics.
	- All three ADRs are properly traceable to recon documents
	(MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE,
	DILOCO_RECONNAISSANCE). The decisions can be challenged; the process
	is auditable, which is rare.
	- Package layout (`loss`, `batch`, `opsd`, `teacher_replay`,
	`ingestion/claude_code`, `diloco`, `trainer`) is sane; optional
	extras correctly avoid forcing TRL/torchft on every install.

	The work product is not zero. It is overclaimed by roughly one
	scorecard tier and one BACKLOG acceptance criterion. Fixing items
	1, 2, 3, 5 above moves the framework from "publishable with a generous
	reviewer" to "publishable with a critical reviewer." Items 4 and 6
	move it from "research replication" to "evidenced research replication."