Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)

Self-audit of whether the framework actually encapsulates the original
brief. Recovered the brief verbatim from the originating session, decomposed
into 8 atomic clauses (V1-V8), built a traceability matrix to current
deliverables, and ran adversarial self-review.

Honest scorecard: 5/10 pass, 4/10 fail, 1/10 partial.

Strengths verified:
V1 Composer 2.5 understanding (audited)
V3 Five-component stack (TRL+VeRL coded; OpenEnv default; Forge paused; Monarch design-aligned)
V6 N-teacher distillation economic feasibility
V7 Rigorous research process

Real gaps found:
V2 DiLoCo: deferred to v0.2 but brief said "combine with diloco" - drift
V4 "framework" vs "skeleton": no installable package, no examples dir
V5 "real llm-application traces" - we used synthetic 50-state fixture
V8 "any HF model" - architecture generic, never tested on real HF model

Three CPU-only gap-closer spikes proposed:
Spike 006 (real-HF-model smoke, half day) - closes V8
Spike 007 (real-trace ingestion adapter, 1 day) - closes V5
Spike 008 (Streaming DiLoCo smoke, 2 days) - closes V2

Plus Wave 7 packaging (half day) closes V4.

Total: ~5 days CPU-only work takes framework from 5/10 to 9/10 vision
encapsulation. The remaining 1/10 ("does the method improve training")
is the GPU-bound empirical question that gates the v0.1 paper - correctly
out-of-scope for vision encapsulation.

Adversarial self-review answered 5 objections including:
- "synthetic states don't generalize to real 10K-token rollouts"
- "VeRL path has zero unit tests"
- "Monarch integration is paper-only"
- "any HF model overstates - what about base no-chat-template / VLM?"

Files:
docs/VISION_VALIDATION.md (29KB, 236 lines, comprehensive)
README.md (added vision-audit link)

No code changed. 38/38 tests still pass. Nothing posted publicly.

Files changed (2) hide show

README.md +2 -0
docs/VISION_VALIDATION.md +236 -0

README.md CHANGED Viewed

@@ -40,6 +40,8 @@ This repository is the **"paper of the project"** — it is the methodology / re
 📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
 See [`spikes/README.md`](spikes/README.md) for the 5-stage spike plan, [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) for the per-framework extension-point analysis, and [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) for runnable trainer code.
 ---

 📝 **Publication materials drafted:** [`publications/`](publications/) contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus `CITATION.cff` and `CITATION.bib` at the repo root. Use [`publications/RELEASE_CHECKLIST.md`](publications/RELEASE_CHECKLIST.md) to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
+🔍 **Vision encapsulation self-audit:** [`docs/VISION_VALIDATION.md`](docs/VISION_VALIDATION.md) — clause-by-clause traceability matrix of the original brief against current deliverables, honest gap analysis (4 real gaps identified), adversarial self-review (5 strongest objections steelmanned), 10-point pass/fail scorecard, and a recommended sequence of three CPU-only spikes (006/007/008) + a packaging wave that takes the framework from **5/10 to 9/10 vision encapsulation in ~5 days, no GPU budget required**. The remaining 1/10 (does the method actually improve training) is the empirical question that gates the v0.1 paper.
 See [`spikes/README.md`](spikes/README.md) for the 5-stage spike plan, [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) for the per-framework extension-point analysis, and [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) for runnable trainer code.
 ---

docs/VISION_VALIDATION.md ADDED Viewed

	@@ -0,0 +1,236 @@

+# Vision Validation: Does the Framework Encapsulate the Original Brief?
+> **Status:** Self-audit, 2026-05-25 (Wave 6).
+> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
+> **Method:** Recover original brief verbatim → atomic-clause decomposition → traceability matrix → adversarial self-review → user-journey simulation → concrete pass/fail scorecard with gap-closing actions.
+This document is **uncomfortable on purpose.** Unit tests answer "does the code work"; this answers "is the code doing what was asked." Those are different questions, and skipping the second is a common failure mode in research projects that drift between brief and ship.
+## 1. The original vision, recovered verbatim
+From the originating message in session `20260525_005800_723eccb8` (timestamp `1779696689.2033243`):
+> *"can you dive into Composer 2.5 and understand what makes it so much better? I want to see if I can take that and combine it with **diloco (decoupled, open, any variant of diloco)** and monarch/torchforge/openenv/VeRL/TRL and make a framework that we can use to further RL training of models to take them to the next level. One of the ideas that I had that might be a parallel to this is to **use traces from an llm-application usage** then **replay the traces with different models** to see at each llm-step what the llm would do. by doing this we get distillation data from any number of models that could be used to train the target model further. can we reserach all of this and see how we could try to set this up as a framework **to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5**"*
+Atomic-clause decomposition (the unit each deliverable maps onto):
+| Clause | Vision element | Verbatim phrasing |
+|---|---|---|
+| **V1** | Understand Composer 2.5 internals | *"dive into Composer 2.5 and understand what makes it so much better"* |
+| **V2** | Integrate **DiLoCo** (any variant: decoupled, open, etc.) | *"combine it with **diloco (decoupled, open, any variant of diloco)**"* |
+| **V3** | Integrate **Monarch / TorchForge / OpenEnv / VeRL / TRL** | *"and monarch/torchforge/openenv/VeRL/TRL"* |
+| **V4** | Build it as a **framework** (not a one-off recipe) | *"make a framework that we can use to further RL training of models"* |
+| **V5** | **Trace-replay from real llm-application usage** as the novel idea | *"use traces from an llm-application usage then replay the traces with different models to see at each llm-step"* |
+| **V6** | N-teacher distillation from those traces | *"distillation data from any number of models that could be used to train the target model further"* |
+| **V7** | Research the whole space rigorously | *"can we reserach all of this"* |
+| **V8** | **Generalize to any HF model**, target Composer-2.5-quality outcomes | *"to take **any model from huggingface** and be able to further RL train it to get results to Composer 2.5"* |
+## 2. Traceability matrix — what's where in the repo
+Map each vision clause to its concrete deliverable. Citations are file paths in this repository.
+| Vision | Status | Deliverable evidence | Honest assessment |
+|---|---|---|---|
+| **V1** Composer 2.5 internals | 🟢 Strong | `research/01-composer-2.5.md` (parallel-research dispatch), `docs/COMPOSER_RECIPE_MAPPING.md` (primary-source audit, every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`) | Caught and patched the SDPO/OPSD discovery that the initial dispatch missed. Audit notice on the original research note. **Solid.** |
+| **V2** DiLoCo integration | 🟡 **Deferred** | `research/02-diloco-family.md` (covered conceptually); `framework/composer-replication-framework.md` § "Distributed sync" (decision: defer until multi-cluster); `docs/INTEGRATION_ARCHITECTURE.md` (mentions DiLoCo as v0.2) | We decided DiLoCo is v0.2 work. **The decision is documented but it is a deviation from the original brief.** The brief said "combine it with diloco," not "consider diloco." See § 4.1. |
+| **V3** Monarch / TorchForge / OpenEnv / VeRL / TRL | 🟢 Strong | `research/03-monarch-torchforge-openenv.md`, `research/04-verl-trl.md`, `docs/INTEGRATION_ARCHITECTURE.md` (extension-point matrix), `spikes/005-integrated-trainer-skeleton/` (TRL + VeRL working code) | TRL and VeRL paths are coded; OpenEnv is the env substrate; Monarch + TorchForge are correctly assessed (Forge is "development paused"). **Five out of five components addressed; two have working code, three are correctly characterized as patterns/reference.** |
+| **V4** Framework, not one-off recipe | 🟡 **Skeleton, not framework** | `spikes/005-integrated-trainer-skeleton/` has component-modular code (`opsd_loss.py`, `teacher_replay.py`, `data_collator.py`, two trainer paths); `docs/INTEGRATION_ARCHITECTURE.md` documents the composition contract | What we have is a **trainer skeleton with verified composition**, not yet a productized framework with installable package, CLI, examples directory. See § 4.2. |
+| **V5** Real llm-application traces | 🔴 **Substituted with synthetic** | `spikes/001-teacher-replay-cost/synthesize_trace.py` builds 50 hand-crafted SWE-bench-lite-shaped states; `spikes/002a-trace-collection-trl/README.md` plans the real-trace path but unrun | **The brief explicitly says "traces from an llm-application usage."** We validated the *replay mechanism* on synthetic states. Real traces from a real agentic application are not yet ingested. See § 4.3. |
+| **V6** N-teacher distillation | 🟢 Strong | `spikes/001-teacher-replay-cost/` ($0.98/trace verified, 150 real OpenRouter calls, 0 errors); `spikes/005-integrated-trainer-skeleton/teacher_replay.py` (DPO-pair extractor, 7 unit tests); economic feasibility is the strongest empirical result so far | Verified. The novel claim's economic floor is established. **Strongest part of the work.** |
+| **V7** Rigorous research | 🟢 Strong | 5 deep-dives by 5 LLM families (`research/01..05`), 16KB methodology paper (`publications/PAPER_v0.md`), recipe-mapping audit, integration architecture, 38/38 unit tests, full citation graph | Process discipline visible: blog audit caught primary-source omissions, DeepWiki audits verify framework extension surfaces, every claim is sourced. **Solid.** |
+| **V8** Any HF model → Composer-quality | 🔴 **Architecturally yes, empirically untested** | `spikes/005-integrated-trainer-skeleton/` plans `Qwen3-7B` (v0.0) → `Qwen3-32B` (v0.1); but the only smoke test is on a 10K-parameter custom `TinyLM`. No real HF model has been touched yet. | **Massive gap between architecture and evidence.** The framework targets `AutoModel.from_pretrained(...)` but has never loaded one. See § 4.4. |
+## 3. Honest scorecard
+Ten concrete pass/fail tests covering both "do we encapsulate the vision" and "is what we have actually correct":
+| # | Test | Pass/Fail | Evidence | Gap-closer (if fail) |
+|---|---|---|---|---|
+| 1 | Original brief is recoverable verbatim and clause-decomposed | ✅ | § 1 of this doc | — |
+| 2 | Each of {Composer, DiLoCo, Monarch, Forge, OpenEnv, VeRL, TRL, trace-replay, HF-base} has a documented deliverable | ✅ | § 2 traceability | — |
+| 3 | The Composer 2.5 mechanism (SDPO/OPSD link) is correctly identified | ✅ | `docs/COMPOSER_RECIPE_MAPPING.md` § 2.1; primary-source audited | — |
+| 4 | The novel TR-DPO channel is empirically feasible (not just paper) | ✅ | spike 001: 150 real calls, $0.98/trace, 0 errors | — |
+| 5 | All three reward channels compose and don't fight each other | ✅ | spike 005 `test_loss_composition_smoke.py`: 5-step train decreases loss | — |
+| 6 | DiLoCo is integrated *somewhere* in the runnable stack | ❌ | Conceptually documented in `research/02`; **no code** | Spike 008 (proposed § 6) — Streaming DiLoCo outer loop on a stub trainer |
+| 7 | A real HuggingFace model can load + run a single forward pass through `ComposerReplicationTrainer` | ❌ | TinyLM only; never `AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.5B")` | Spike 006 (proposed § 6) — real-HF-model smoke test |
+| 8 | At least one trace from real LLM-application usage is ingested end-to-end | ❌ | Synthetic 50-state fixture only | Spike 007 (proposed § 6) — real-trace ingestion from Cline / OpenHands / Claude Code session export |
+| 9 | The framework is *installable* (a user can `pip install` and have working entrypoints) | ❌ | No `pyproject.toml`, no installable package | Wave 7 — packaging |
+| 10 | A non-author can complete the "I have Qwen3-7B, I want a Composer-style variant" journey by reading docs | ⚠️ partial | Possible to read your way through `INTEGRATION_ARCHITECTURE.md` + `composer_trainer.py`, but no end-to-end runnable example | Spike 006 + a `examples/qwen3_7b_quickstart.md` |
+**Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.
+## 4. The four real gaps, each examined
+### 4.1 V2: DiLoCo deferral — is this a drift?
+**The drift:** the brief says *"combine it with diloco."* The framework documents say DiLoCo is v0.2 work, deferred until training spans multiple clusters.
+**The defense:** Streaming DiLoCo's outer loop is only useful when training cannot fit on one cluster. For a Qwen3-7B (v0.0) or Qwen3-32B (v0.1) run on a single 8×H100 node, FSDP2 is sufficient — adding a DiLoCo outer loop would be bolt-on infrastructure with no measurable benefit. PRIME-RL (which we recommend as the substrate) has DiLoCo-shape sync between geographically distributed inference workers, but the trainer itself is single-cluster FSDP2. INTELLECT-2 (Prime Intellect's 32B QwQ run) is the only production-scale precedent for trainer-side DiLoCo, and even there the headline contribution was the orchestrator/trainer/inference split, not the gradient sync.
+**The honest read:** the deferral is technically defensible, but it is **a deviation from the brief** and we should not pretend otherwise. The user explicitly said "any variant of diloco" — meaning the brief permits weak forms (e.g., outer-loop sync between geographically distributed inference workers, even with a single trainer cluster). We could add that *now*, on the existing v0.0 architecture, without waiting for v0.2.
+**Concrete gap-closer:** **Spike 008** — implement Streaming DiLoCo outer-loop sync between two simulated "clusters" (could literally be two FSDP groups on the same machine for the smoke test). Validates that the outer-loop integrates with PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST) without breaking the GRPO + SDPO + TR-DPO loss composition. Estimated effort: ~2 days, no new GPU budget if we use a tiny model. Closes V2.
+### 4.2 V4: framework vs. skeleton
+**The drift:** the brief says *"make a framework."* We have a "trainer skeleton" with component-modular code and 38 unit tests but no installable package, no CLI, no examples directory, no quickstart that resolves to a working training run.
+**The honest read:** spike 005 is genuinely modular (`opsd_loss.py`, `teacher_replay.py`, `data_collator.py`, `composer_trainer.py`, `composer_adv.py` are independent components composing through clean interfaces). But "framework" carries connotations of installability, examples, documentation site, versioned releases. We have the *components* of a framework; we have not assembled them into the *artifact* a third party would call a framework.
+**Concrete gap-closer:** **Wave 7 — packaging.** Add a top-level `pyproject.toml` with `composer-replication-framework` as a package; expose `from composer_replication_framework import ComposerReplicationTrainer, ComposerDataCollator, generalized_jsd_loss`; ship `examples/qwen3_7b_swe_bench_lite/` with a runnable `train.py`. Estimated effort: ~half a day once spike 006 (real-model smoke) lands. Closes V4 properly.
+### 4.3 V5: synthetic states vs. real llm-application traces
+**The drift:** the brief is unambiguous: *"use traces from an llm-application usage."* Spike 001 used **50 hand-crafted SWE-bench-lite-shaped states**, not real traces from a real agentic coding application.
+**The defense:** the goal of spike 001 was to measure the *economic floor* of N-teacher replay. Synthetic states with realistic shape and token-count distributions are sufficient for that purpose — we get unbiased latency and cost numbers. The shape of real traces (multi-turn, ~250-500 tokens of context per state, tool-call decision points) was matched.
+**The honest read:** spike 001's economic verdict generalizes to real traces *if* their shape is similar. But the brief's intent is bigger than "measure cost" — the brief envisions ingesting real traces (e.g., from Cursor session exports, OpenHands traces, Claude Code transcripts, Cline rollouts) and harvesting them for training data. **We have the *replay mechanism* but no *ingestion pipeline*.** Real traces have warts our synthetic ones don't: malformed tool calls, mid-rollout context truncation, vendor-specific schema, PII to scrub.
+**Concrete gap-closer:** **Spike 007 — real-trace ingestion.** Pick one real source (proposal: Claude Code session JSONL exports, since I have access to my own and they're well-structured), write an adapter that converts to the `TraceExample` schema the data collator expects, run it through spike 005's pipeline. Validates the real → synthetic → trainer path works without contortion. Estimated effort: ~1 day, no GPU. Closes V5 substantively.
+### 4.4 V8: HF model generalization — architecture vs. evidence
+**The drift:** the brief targets *"any model from huggingface"* with Composer-2.5-quality outcomes. The architecture is designed for `AutoModelForCausalLM.from_pretrained(...)`, but the only smoke test is on a 10K-parameter custom `TinyLM`.
+**The defense:** the integration claim ("all three channels compose, ablate, and train without divergence") is *generic*. It holds for any sufficiently-differentiable model. The TinyLM is a stand-in for any HF model. Real-HF-model testing is GPU-bound work that's properly in the spike 002+ tier.
+**The honest read:** "the architecture is generic" is theoretically true and practically dangerous. Real HF models have:
+- Tokenizer chat templates that the data collator must respect (`StubTokenizer` in spike 005 fakes `apply_chat_template`).
+- Real vocab sizes (Qwen3 = 152K vs TinyLM's 64) where top-k restrictions in the SDPO loss matter.
+- FlashAttention-2 attention paths the OPSD reference relies on.
+- vLLM rollout integration for the outer GRPO loop.
+Any of these can break the "tested in the small" claim when scaled up. **We should run a single forward pass and a single loss computation on a real (small, but real) HF model before claiming the framework generalizes.**
+**Concrete gap-closer:** **Spike 006 — real-HF-model smoke.** Load `Qwen/Qwen3-0.5B` (or `Qwen/Qwen2.5-0.5B-Instruct` if 3 isn't out yet), wire up a real `AutoTokenizer` (not `StubTokenizer`) to the data collator, run a single forward pass + a single backward pass through `composer_total_loss` with all three channels active. CPU-only is fine; the test is wiring correctness, not training. Estimated effort: ~half a day, no GPU rental. **Closes V8's evidence gap without requiring spike 002–004's GPU budget.**
+## 5. User-journey walkthrough — find the breaks
+Simulate the brief's intended user: *"I have Qwen3-7B. I want a Composer-style variant. Walk me through it."*
+The journey, as it would actually run today:
+| Step | What the user does | What happens | Break? |
+|---|---|---|---|
+| 1 | Lands on the HF repo | Reads the README | ✅ Clear status, links to publications |
+| 2 | Reads `publications/PAPER_v0.md` | Understands the architecture | ✅ Comprehensive |
+| 3 | Reads `docs/INTEGRATION_ARCHITECTURE.md` | Picks TRL path | ✅ Clear extension-point matrix |
+| 4 | Clones the repo, navigates to `spikes/005-integrated-trainer-skeleton/` | Reads the skeleton README | ✅ |
+| 5 | Tries to install dependencies | **No `pyproject.toml`. No `requirements.txt` at the spike level.** | ❌ Break — has to figure out deps from imports |
+| 6 | Tries to run an example | **No `examples/` directory.** Skeleton tests pass but there's no end-to-end "load Qwen3-7B + train one step" script | ❌ Break — has to assemble it themselves from components |
+| 7 | Tries `from trl_path.composer_trainer import ComposerReplicationTrainer` | Works iff TRL is installed | ⚠️ Latent: needs TRL ≥ some version, undocumented |
+| 8 | Wires up an `AutoModel` and `AutoTokenizer` | **Untested — `data_collator.py` falls back to a stub tokenizer code path; real chat templates may not be exercised** | ⚠️ Latent risk |
+| 9 | Tries to source training data | **No instructions for trace collection (spike 002 unrun); synthetic stub fixture is in spike 001 but not labeled as a starter dataset** | ❌ Break |
+| 10 | Realizes they need teacher API credentials | `OPENROUTER_API_KEY` envvar — documented in `teacher_replay.py` docstring but not in a top-level setup guide | ⚠️ Findable but suboptimal |
+| 11 | Wants to run the spike-004 A/B comparison | **No script. No config template. Spike 004 README is planning notes, not runnable code** | ❌ Break — they'd have to write the experiment harness themselves |
+**Verdict:** the architecture is reachable from docs, but a third-party can't currently complete the journey end-to-end without significant assembly. The framework needs **packaging + examples + a quickstart** to credibly claim "any HF model" generalization.
+## 6. Proposed gap-closing spikes (no GPU budget required)
+These three sub-projects close the four real gaps identified in § 4 and don't need GPU rental — they're CPU-only or use $5 of API. They can run *before* spike 002–004 to make the framework actually deliver on the brief.
+### Spike 006 — Real-HF-model smoke (closes V8)
+- **Goal:** load `Qwen/Qwen2.5-0.5B-Instruct` via `AutoModelForCausalLM` + `AutoTokenizer`, wire to `ComposerDataCollator` with real `apply_chat_template`, run one forward + one backward through `composer_total_loss(α=0.1, β=0.05)`, verify finite gradient on every parameter.
+- **Hardware:** CPU sufficient (model fits in ~1GB RAM).
+- **Effort:** ~half a day.
+- **Pass criterion:** test in `tests/test_real_hf_model_smoke.py` passes; loss is finite and decreases over 5 steps on a fixed batch.
+### Spike 007 — Real-trace ingestion (closes V5)
+- **Goal:** Write `adapters/claude_code.py` (or `cline.py` or `openhands.py` — pick one) that converts a real session export into a list of `TraceExample` dicts. Run spike 001's `replay_trace` on 5 real states. Run spike 005's pipeline end-to-end on the resulting batch.
+- **Hardware:** CPU.
+- **Effort:** ~1 day.
+- **Pass criterion:** `python adapters/claude_code.py < session.jsonl > traces.jsonl` produces collator-compatible output; `composer_total_loss` runs on it without error; one DPO pair successfully extracted from teacher disagreement.
+### Spike 008 — Streaming DiLoCo smoke (closes V2)
+- **Goal:** Bolt a Streaming DiLoCo outer loop onto the `composer_total_loss` smoke test. Use two FSDP process groups on the same node as a stand-in for two clusters. Verify pseudo-gradient sync every H steps doesn't break loss composition.
+- **Hardware:** CPU sufficient (TinyLM scale).
+- **Effort:** ~2 days (DiLoCo's PyTorch reference is a ~200 LOC outer loop).
+- **Pass criterion:** 5-step training run with α=0.1, β=0.05, DiLoCo H=2 still decreases loss; pseudo-gradient sync produces no NaN.
+After 006 + 007 + 008, the scorecard goes from 5/10 pass to 8/10 pass. Wave 7 (packaging) closes #9 and #10. **Then the framework genuinely delivers on the brief, before any of the GPU-bound spikes 002–004 run.**
+## 7. Adversarial self-review
+The five strongest objections to "this framework encapsulates the vision," steelmanned and answered:
+### Objection 1: "You spent more time on publication materials than on closing real gaps."
+**Steelman:** Wave 5 produced 1,200 lines of publication materials. Wave 6 (this doc) is more publication. Meanwhile spike 006/007/008 — actual integration work — is unwritten. Optimizing for paper-readiness over framework-readiness is a known failure mode.
+**Answer:** Conceded with caveat. Wave 5 was at the user's explicit request. This wave (vision validation) was the first introspective check; gap-closers are now scoped and ready to execute. The next wave should be 006 (real-model smoke) before any further publication work. **Reordering accepted.**
+### Objection 2: "Spike 001's $0.98/trace doesn't generalize to real traces. You measured what's cheap (50 short hand-crafted states), not what's real (10K-token rollouts with embedded code blobs)."
+**Steelman:** Real Cursor / Cline / OpenHands rollouts can have 10–100K tokens of context. At Opus pricing ($15/Mtok input), a 50-step replay over 50K-token rollouts costs $37.50 in input tokens alone *per teacher per trace*. With 3 teachers that's ~$112 per trace, dwarfing the synthetic-state estimate.
+**Answer:** Largely valid. Spike 001's verdict is *valid* but its *generalization* to real traces is unproven. Spike 007 (real-trace ingestion) is the right next experiment. Once it lands, we can rerun spike 001's analysis on real-shaped traces and report an updated cost number. The framework's economic claim should be updated to include both the synthetic-floor result *and* a real-trace measurement when available.
+### Objection 3: "VeRL is recommended for v0.2 but the only smoke test is in TRL. The VeRL `composer_adv.py` has zero unit tests."
+**Steelman:** `verl_path/composer_adv.py` ships as untested code. The DeepWiki audit confirmed extension surfaces but didn't validate that the resulting `compute_grpo_composer_advantage` function correctly composes with VeRL's `compute_advantage` dispatcher. Anyone choosing the VeRL path is running unverified code.
+**Answer:** Correct. The VeRL path is a *design verified by primary-source audit*, not *a tested implementation*. Closing this requires installing VeRL + Ray + a real model — non-trivial. Reasonable interim mitigation: explicitly mark `verl_path/` as `STATUS: design-only` in its README and warn users the TRL path is the only tested one. Long-term gap-closer: spike 002b's PRIME-RL/VeRL run is the natural place to validate.
+### Objection 4: "The 'integration with Monarch and TorchForge' claim is paper-thin. Forge is paused. Monarch K8s is documented but not used. Where's the runnable Monarch ActorMesh?"
+**Steelman:** The integration matrix in `INTEGRATION_ARCHITECTURE.md` mentions Monarch ActorMesh patterns for SDPO and TR-DPO, but no code. TorchForge is "paused" so we route around it. The actual integration with Meta's stack is documentation, not code. This is V3 partial, not V3 strong.
+**Answer:** Half right. Forge being paused upstream is genuinely orthogonal to our work — we can't depend on a paused project. But the Monarch integration *is* paper-only. The honest framing: we *integrate with the design philosophy* of Monarch (single-controller actor-mesh orchestration) by ensuring the components are *placeable* on a Monarch mesh, but we have no runnable Monarch code. Should soften the V3 claim in the README from "integrate with all five" to "integrate with TRL + VeRL + OpenEnv (coded); align with Monarch + Forge philosophy (documented)."
+### Objection 5: "You claim 'any HF model from huggingface' but the architecture is implicitly designed around a chat-template-having causal LM. What about base models without chat templates? Encoder-decoder models? VLMs?"
+**Steelman:** `data_collator.py::_tokenize_messages` calls `apply_chat_template` and falls back to plain text concat. For a base model without a chat template (e.g. `gpt2`, `Qwen3-0.5B-Base`), the fallback path may not produce coherent input. For encoder-decoder models the whole "single forward, gather logits" assumption breaks. VLMs add another dimension. The "any HF model" claim is overstated.
+**Answer:** Accurate. The framework is designed for **causal LM with chat templates**, which is the standard target for agentic-coding RL post-training. The brief's "any model from huggingface" should be re-scoped to **"any HuggingFace causal LM with a chat template (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families)."** README + paper should say this explicitly. Encoder-decoder, base-no-chat-template, and VLM support is out of scope for v0.0/v0.1. Adding it would be a separate research direction.
+## 8. What this validation actually says
+The framework **partially encapsulates** the vision. Strengths:
+- Composer 2.5 mechanism is correctly identified and grounded in published prior art (V1 ✅)
+- Five-component agentic-RL stack is mapped with primary-source-audited extension points (V3 ✅)
+- N-teacher distillation channel is empirically feasible (V6 ✅)
+- Research process is rigorous, sourced, and self-correcting (V7 ✅)
+- Composition smoke testing on a tiny model is a real empirical claim (parts of V4, V8 ✅)
+Real gaps the brief asked for that we punted on:
+- DiLoCo deferral is documented but is a deviation (V2 🟡)
+- "Framework" is a skeleton; no installable package or examples (V4 🟡)
+- Real llm-application traces are not ingested anywhere (V5 🔴)
+- "Any HF model" is architecturally generic but never tested on a real HF model (V8 🔴)
+These gaps are closable without GPU rental via three CPU-only spikes (006, 007, 008) totaling ~3.5 days of effort. After they land + a packaging wave (Wave 7), the scorecard goes from 5/10 to 9/10 and the framework genuinely delivers on the original brief.
+The gap that remains unclosed at 9/10 is #4 ("the framework actually trains better models than baseline GRPO") which requires the GPU spikes 002–004. That's correctly out-of-scope for vision encapsulation — it's *empirical validation of the methodology*, not encapsulation of the brief.
+## 9. Recommended next moves, ordered
+In recommended-do-next order:
+1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
+2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
+3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
+4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen3_05b_quickstart/` directory + entry-point exposure.
+5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
+Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
+After that, GPU-bound spikes 002–004 are what move 9/10 → empirical validation of the methodology itself, which is a separate project phase (the "v0.1 follow-up paper" framing in the publication wave).
+## 10. How to keep validating going forward
+This document captures one snapshot. Vision encapsulation drifts as projects evolve. Three mechanisms to keep checking:
+- **Re-audit at every release wave.** Each commit message currently includes a wave number. Wave 7 (packaging), wave 8 (post-spike-006), etc. should each end with a one-paragraph "vision-check delta" — what gaps closed, what new ones opened, whether the README's claims still match the code.
+- **External review via the HF Discussions tab.** The pre-experimental release posts (drafted in `publications/`) explicitly ask for critical reads of the integration architecture and adjacent-work pointers. Specifically pin one Discussion thread for "vision encapsulation feedback" — invite people to point out where the framework deviates from its stated goals.
+- **Calibration check post-spike-004.** When the GPU spikes finally run and produce a result (positive or negative), revisit this document and update the scorecard. If TR-DPO doesn't beat plain GRPO, V6 needs to be downgraded. If it does, V8 should incorporate the empirical evidence. The validation is a living artifact, not a one-shot audit.
+---
+*Self-audit complete. Honest read: 5/10 today, 9/10 reachable in ~5 days of CPU-only work, last 1/10 is the empirical question that gates the v0.1 paper.*