Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Merge docs/refine-2026-06: documentation refinement pass
Browse filesBrings the ccode-ultracode docs engagement (4 waves + 1 reconciliation)
into main: channel-3 provenance correction, gap-honesty, dead-link fixes,
archived point-in-time wave reviews (with redirect stubs), new OVERVIEW.md,
ADR-014 indexing. Reconciled the main-lags-master foot-gun guidance to its
RESOLVED state (main is canonical and synced). Docs-only; no source touched.
- BACKLOG.md +2 -2
- README.md +5 -2
- docs/ALTERED_MINDS_TIE_IN.md +18 -5
- docs/DEEP_WORK_LOOP_LOG.md +5 -45
- docs/HF_REPO_LAYOUT.md +2 -2
- docs/INTEGRATION_RECIPES.md +1 -1
- docs/OVERVIEW.md +97 -0
- docs/USER_GUIDE.md +7 -0
- docs/VISION_VALIDATION.md +23 -6
- docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md +7 -73
- docs/_archive/DEEP_WORK_LOOP_LOG.md +47 -0
- docs/_archive/README.md +39 -0
- docs/_archive/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md +75 -0
- docs/_archive/reviews/cross-family-adr-008-009-010-2026-05-29/SYNTHESIS.md +90 -0
- docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_deepseek-v4-pro.md +0 -0
- docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_gemini-3.1-pro.md +0 -0
- docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_gpt-5.5.md +0 -0
- docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_grok-4.3.md +0 -0
- docs/_archive/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md +33 -0
- docs/{reviews → _archive/reviews}/final-verify-deep-work-2026-05-29/verify_gemini-3.1-pro.md +0 -0
- docs/{reviews → _archive/reviews}/final-verify-deep-work-2026-05-29/verify_gpt-5.5.md +0 -0
- docs/_refine-2026-06-SUMMARY.md +113 -0
- docs/adrs/README.md +7 -0
- docs/research/WAVE_13_FINAL_REVIEW.md +6 -237
- docs/research/WAVE_14_FINAL_REVIEW.md +5 -261
- docs/research/WAVE_15_FINAL_REVIEW.md +5 -74
- docs/research/WAVE_7_10_FINAL_REVIEW.md +6 -421
- docs/research/_archive/README.md +39 -0
- docs/research/_archive/WAVE_13_FINAL_REVIEW.md +239 -0
- docs/research/_archive/WAVE_14_FINAL_REVIEW.md +263 -0
- docs/research/_archive/WAVE_15_FINAL_REVIEW.md +76 -0
- docs/research/_archive/WAVE_7_10_FINAL_REVIEW.md +423 -0
- docs/reviews/cross-family-adr-008-009-010-2026-05-29/SYNTHESIS.md +7 -88
- docs/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md +6 -31
- framework/composer-replication-framework.md +2 -2
- publications/HF_DISCUSSION_POST.md +6 -6
BACKLOG.md
CHANGED
|
@@ -71,8 +71,8 @@ Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datag
|
|
| 71 |
**Acceptance**:
|
| 72 |
1. `pyproject.toml` at repo root, package name `composer_replication`.
|
| 73 |
2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
|
| 74 |
-
3. `examples/
|
| 75 |
-
4. README quickstart updated to `pip install -e .` + `python examples/
|
| 76 |
5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
|
| 77 |
|
| 78 |
### Post-Skeleton Waves (Datagen, Alignment, Quality)
|
|
|
|
| 71 |
**Acceptance**:
|
| 72 |
1. `pyproject.toml` at repo root, package name `composer_replication`.
|
| 73 |
2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
|
| 74 |
+
3. `examples/qwen_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
|
| 75 |
+
4. README quickstart updated to `pip install -e .` + `python examples/qwen_05b_quickstart/run.py`.
|
| 76 |
5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.
|
| 77 |
|
| 78 |
### Post-Skeleton Waves (Datagen, Alignment, Quality)
|
README.md
CHANGED
|
@@ -28,11 +28,14 @@ pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
|
|
| 28 |
> **Author:** [Codeseys](https://huggingface.co/Codeseys)
|
| 29 |
> **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
|
| 30 |
|
| 31 |
-
This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe.
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Install
|
| 34 |
|
| 35 |
```bash
|
|
|
|
| 36 |
pip install -e .
|
| 37 |
python examples/qwen_05b_quickstart/run.py
|
| 38 |
```
|
|
@@ -163,7 +166,7 @@ The novel contribution is channel (3) — no published work systematically repla
|
|
| 163 |
| Phase | Timeline | Goal | Trained variant repo | Data repo |
|
| 164 |
|---|---|---|---|---|
|
| 165 |
| **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
|
| 166 |
-
| **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay
|
| 167 |
| **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
|
| 168 |
|
| 169 |
Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
|
|
|
|
| 28 |
> **Author:** [Codeseys](https://huggingface.co/Codeseys)
|
| 29 |
> **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any HuggingFace causal LM with a chat template** (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families), using a synthesis of decentralized RL post-training techniques. *(LM-CE + DPO channels empirically verified on Qwen2.5-0.5B-Instruct via Spike 006; the SDPO channel is verified via the `compose_loss` integration test suite + the `examples/gsm8k_grpo_with_sdpo/` and `examples/sdpo_with_real_traces/` end-to-end smokes — Spike 006's run zeroed the SDPO channel because the hint-context shape didn't align with the student context, the same alignment discipline `ComposerDataCollator` enforces in production. Encoder-decoder models, base models without chat templates, and VLMs are out of scope for v0.)*
|
| 30 |
|
| 31 |
+
This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe (this channel is the *framework's own addition* — it is **not** part of Cursor's actual recipe; see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md)).
|
| 32 |
+
|
| 33 |
+
> 🧭 **New here?** Read [`docs/OVERVIEW.md`](docs/OVERVIEW.md) for a 5-minute, honestly-scoped tour: what this is, the three channels with their real provenance, what's proven, and what's still gapped.
|
| 34 |
|
| 35 |
## Install
|
| 36 |
|
| 37 |
```bash
|
| 38 |
+
# `main` is canonical and synced with `master` — a fresh clone works as-is.
|
| 39 |
pip install -e .
|
| 40 |
python examples/qwen_05b_quickstart/run.py
|
| 41 |
```
|
|
|
|
| 166 |
| Phase | Timeline | Goal | Trained variant repo | Data repo |
|
| 167 |
|---|---|---|---|---|
|
| 168 |
| **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
|
| 169 |
+
| **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill, i.e. channels 1–2 = Dr.GRPO + SDPO) **plus** the framework's own additive trace-replay-DPO channel, on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. *(trace-replay-DPO is our addition, not part of Cursor's recipe — see [ADR-014](docs/adrs/ADR-014-policy-optimization-objective-menu.md).)* | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
|
| 170 |
| **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
|
| 171 |
|
| 172 |
Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
|
docs/ALTERED_MINDS_TIE_IN.md
CHANGED
|
@@ -111,11 +111,24 @@ only — never rationale style, so distorted-but-persuasive reasoning is not
|
|
| 111 |
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
|
| 112 |
step — the washout-vs-amplification instrument).
|
| 113 |
|
| 114 |
-
Train for ~500 steps per arm on a single GPU (
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
### Phase 4 — re-evaluate
|
| 121 |
|
|
|
|
| 111 |
rewarded), and `dual_kl_logger` (logs KL-to-altered-init **and** KL-to-base each
|
| 112 |
step — the washout-vs-amplification instrument).
|
| 113 |
|
| 114 |
+
Train for ~500 steps per arm on a single GPU. **Runnability today (2026-06):**
|
| 115 |
+
only **A1 (GRPO-only)** has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
|
| 116 |
+
records that "the A1 run used `dr_grpo`" and that wiring the `objective=` menu through the
|
| 117 |
+
rest of the ladder runners is an open follow-up (Qwen-0.5B feasibility-test confirmed;
|
| 118 |
+
for Llama-8B use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090
|
| 119 |
+
is too small). **A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder
|
| 120 |
+
only**: running them on a real 8B checkpoint additionally needs a real error-trace SDPO
|
| 121 |
+
dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet —
|
| 122 |
+
none of those is a closed artifact today. The real 8B/LMA-checkpoint run is *additionally*
|
| 123 |
+
**user-gated** (it spends grant budget). [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)
|
| 124 |
+
ships the ladder scaffolding + the A1 capability, proven CPU-only on a small model
|
| 125 |
+
(`examples/altered_minds_channel_ladder/`); its sole remaining acceptance-gate box is that
|
| 126 |
+
user-gated real-spend go/no-go.
|
| 127 |
+
|
| 128 |
+
> **strip_thinking × SDPO foot-gun (A2/A4).** When the SDPO arms become runnable on real
|
| 129 |
+
> agent traces, SDPO REQUIRES `strip_thinking=False`: ~67% of error-recovery turns are
|
| 130 |
+
> pure thinking, so stripping them yields empty SDPO masks (the channel silently
|
| 131 |
+
> contributes nothing). Keep thinking tokens in the context for any SDPO-active arm.
|
| 132 |
|
| 133 |
### Phase 4 — re-evaluate
|
| 134 |
|
docs/DEEP_WORK_LOOP_LOG.md
CHANGED
|
@@ -1,47 +1,7 @@
|
|
| 1 |
# Deep Work Loop Log — Composer 2.5 Replication Framework
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
> Take any HuggingFace model → further RL train it using:
|
| 10 |
-
> 1. RLVR (tests-pass reward),
|
| 11 |
-
> 2. SDPO/hint-distillation (Composer 2.5's "targeted RL with textual feedback"),
|
| 12 |
-
> 3. multi-teacher trace-replay DPO,
|
| 13 |
-
> integrated against TRL/VeRL/OpenEnv with DiLoCo-style outer loop sync.
|
| 14 |
-
>
|
| 15 |
-
> Output: a published, reproducible framework — the "Composer 2.5 replication" the open ecosystem is missing.
|
| 16 |
-
|
| 17 |
-
## Starting state
|
| 18 |
-
|
| 19 |
-
- HEAD: `040eff8` (Wave 6: vision validation self-audit, 5/10 scorecard)
|
| 20 |
-
- Tests: 38/38 green in `spikes/005-integrated-trainer-skeleton/`
|
| 21 |
-
- Working tree: clean
|
| 22 |
-
|
| 23 |
-
## Phase ledger
|
| 24 |
-
|
| 25 |
-
| Phase | Description | Status | Started | Done |
|
| 26 |
-
|---|---|---|---|---|
|
| 27 |
-
| 1 | commit-state | ✅ | 2026-05-26 | 2026-05-26 |
|
| 28 |
-
| 2 | backlog-audit (BACKLOG.md from VISION_VALIDATION) | ✅ | 2026-05-26 | 2026-05-26 |
|
| 29 |
-
| 3 | parallel-research (3 subagents) | 🟡 | 2026-05-26 | |
|
| 30 |
-
| 4 | architect with ADRs (ADR-001..003) | ⏳ | | |
|
| 31 |
-
| 5 | plan in waves (W7–W10) | ⏳ | | |
|
| 32 |
-
| 6 | execute W7 — Spike 006 (real HF model smoke) | ⏳ | | |
|
| 33 |
-
| 7 | execute W8 — Spike 007 (real trace ingestion) | ⏳ | | |
|
| 34 |
-
| 8 | execute W9 — Spike 008 (DiLoCo smoke) | ⏳ | | |
|
| 35 |
-
| 9 | execute W10 — packaging | ⏳ | | |
|
| 36 |
-
| 10 | (Modal-gated) Spike 002a-mini real GPU smoke | ⏳ | | |
|
| 37 |
-
| 11 | cross-model-final-review | ⏳ | | |
|
| 38 |
-
| 12 | update scorecard + push | ⏳ | | |
|
| 39 |
-
|
| 40 |
-
## Constraints
|
| 41 |
-
|
| 42 |
-
- Verify ALL claims against primary sources (Wave 2 lesson — subagent synthesis is not evidence).
|
| 43 |
-
- Tests must pass before commit.
|
| 44 |
-
- Memory L1 is at 99% — write to L2 wiki + L3 fact_store, not L1.
|
| 45 |
-
- Modal budget: $20 hard cap for this loop. Anything more goes to user for approval.
|
| 46 |
-
- No `upload_file` mixing with `git push` — `git push hf master:main` only.
|
| 47 |
-
- Commit messages via `-F /tmp/<wave>-commit-msg.txt`.
|
|
|
|
| 1 |
# Deep Work Loop Log — Composer 2.5 Replication Framework
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This running log of the deep-work-loop sessions has
|
| 4 |
+
> been moved to [`docs/_archive/DEEP_WORK_LOOP_LOG.md`](_archive/DEEP_WORK_LOOP_LOG.md).
|
| 5 |
+
> It is preserved verbatim for provenance but is superseded by the current
|
| 6 |
+
> [`docs/METHODOLOGY.md`](METHODOLOGY.md) and [`BACKLOG.md`](../BACKLOG.md). See
|
| 7 |
+
> [`docs/_archive/README.md`](_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/HF_REPO_LAYOUT.md
CHANGED
|
@@ -17,7 +17,7 @@ When the v0.0 spike produces a result, the following repos will be created:
|
|
| 17 |
| Repo | Type | Created when | Contents |
|
| 18 |
|---|---|---|---|
|
| 19 |
| `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
|
| 20 |
-
| `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with GRPO + trace-replay-DPO |
|
| 21 |
| `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
|
| 22 |
|
| 23 |
After v0.1:
|
|
@@ -26,7 +26,7 @@ After v0.1:
|
|
| 26 |
|---|---|---|
|
| 27 |
| `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
|
| 28 |
| `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
|
| 29 |
-
| `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-
|
| 30 |
|
| 31 |
All trained-variant repos will:
|
| 32 |
- Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.
|
|
|
|
| 17 |
| Repo | Type | Created when | Contents |
|
| 18 |
|---|---|---|---|
|
| 19 |
| `Codeseys/composer-replication-traces-v0` | dataset | v0.0 spike data is collected | 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments |
|
| 20 |
+
| `Codeseys/composer-replication-qwen3-7b-v0` | model | v0.0 spike produces a checkpoint | LoRA adapter or full fine-tune of Qwen3-7B trained with Dr.GRPO + the framework's own additive trace-replay-DPO research channel (channel 3 is our addition, **not** part of Cursor's Composer recipe — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)) |
|
| 21 |
| `Codeseys/composer-replication-qwen3-7b-v0-baseline` | model | v0.0 spike produces a baseline checkpoint | Same training, plain GRPO only (A/B comparison) |
|
| 22 |
|
| 23 |
After v0.1:
|
|
|
|
| 26 |
|---|---|---|
|
| 27 |
| `Codeseys/composer-replication-traces-v1` | dataset | Larger trace corpus + Feature-Deletion environment seed repos |
|
| 28 |
| `Codeseys/composer-replication-feature-deletion-env-v1` | dataset | Repos with passing tests, with deletion masks for the env to apply |
|
| 29 |
+
| `Codeseys/composer-replication-qwen3-32b-v1` | model | Full Composer-channel (Dr.GRPO + SDPO) v1 trained variant, combined with the framework's additive trace-replay-DPO research channel (trace-replay is our addition, not Composer's recipe) |
|
| 30 |
|
| 31 |
All trained-variant repos will:
|
| 32 |
- Link back to **this repo** (`Codeseys/composer-replication-framework`) in their `README.md` as the methodology source.
|
docs/INTEGRATION_RECIPES.md
CHANGED
|
@@ -978,7 +978,7 @@ adapter boundary, not because the loss math is wrong.
|
|
| 978 |
- ADRs:
|
| 979 |
[`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
|
| 980 |
[`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
|
| 981 |
-
[`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)
|
| 982 |
|
| 983 |
---
|
| 984 |
|
|
|
|
| 978 |
- ADRs:
|
| 979 |
[`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
|
| 980 |
[`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
|
| 981 |
+
[`docs/adrs/ADR-007-self-distillation-losses.md`](adrs/ADR-007-self-distillation-losses.md)
|
| 982 |
|
| 983 |
---
|
| 984 |
|
docs/OVERVIEW.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Overview — Composer 2.5 Replication Framework (5-minute read)
|
| 2 |
+
|
| 3 |
+
*Current through [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) (2026-06). For
|
| 4 |
+
the front-door pitch see [`README.md`](../README.md); for the honest gap list see
|
| 5 |
+
[`BACKLOG.md`](../BACKLOG.md); for the clause-by-clause vision audit see
|
| 6 |
+
[`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md).*
|
| 7 |
+
|
| 8 |
+
## What it is
|
| 9 |
+
|
| 10 |
+
An **open, methodology-first replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5)**
|
| 11 |
+
recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic
|
| 12 |
+
coder — generalized so it runs on **any HuggingFace causal LM with a chat template** (Qwen,
|
| 13 |
+
Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package
|
| 14 |
+
(`pip install -e .` → `composer_replication`) plus a research corpus (ADRs, deep-dives,
|
| 15 |
+
recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of
|
| 16 |
+
scope for v0.
|
| 17 |
+
|
| 18 |
+
This repo is the **methodology repo** ("the paper of the project"). Trained-variant model
|
| 19 |
+
repos and trace datasets are split out per [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
|
| 20 |
+
|
| 21 |
+
## The three channels — with honest provenance
|
| 22 |
+
|
| 23 |
+
The framework composes a single training loss out of three additive channels. **Two replicate
|
| 24 |
+
Cursor's published recipe; the third is the framework's own research addition.** Getting this
|
| 25 |
+
provenance right is the whole point — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).
|
| 26 |
+
|
| 27 |
+
| # | Channel | What it is | Provenance |
|
| 28 |
+
|---|---|---|---|
|
| 29 |
+
| **1** | **Base policy optimization** | RL on verifiable rewards (RLVR). Default **Dr.GRPO**, now a **selectable menu** (`make_po_config(objective=…)` over `{grpo, dr_grpo, bnpo, dapo, gspo, cispo}`, per ADR-014). | ✅ **Genuine replication.** Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. |
|
| 30 |
+
| **2** | **SDPO self-distillation** | Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a *self-teacher* → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). | ✅ **Genuine replication.** This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. |
|
| 31 |
+
| **3** | **Trace-replay-DPO** | Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder ([ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)). | ⚠️ **The framework's OWN additive research channel — NOT part of Cursor's recipe.** Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks *on top of* the genuine replication; it does not define it. |
|
| 32 |
+
|
| 33 |
+
> **Read this before citing the framework.** Any statement of the form "Composer does
|
| 34 |
+
> trace-replay-DPO" or "the replication target includes channel 3" is **wrong**. Cursor's
|
| 35 |
+
> recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so.
|
| 36 |
+
|
| 37 |
+
The full loss (verification-harness form) is `total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo`;
|
| 38 |
+
production uses `ComposerReplicationTrainer._compute_loss` (a real `trl.GRPOTrainer` subclass),
|
| 39 |
+
where channel 1 is real GRPO rather than the LM-CE stub. See
|
| 40 |
+
[`docs/USER_GUIDE.md`](USER_GUIDE.md) and [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md).
|
| 41 |
+
|
| 42 |
+
## What's proven
|
| 43 |
+
|
| 44 |
+
- **CPU SDPO-fires.** On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires
|
| 45 |
+
(`sdpo_jsd > 0`) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just
|
| 46 |
+
memorization?" critique is closed (Spike 006-strict).
|
| 47 |
+
- **Real GPU run.** Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss
|
| 48 |
+
0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini).
|
| 49 |
+
- **A1 8B-ladder Modal run.** The GRPO-only arm (A1) of the LMA channel ladder has a real
|
| 50 |
+
Modal runner and has been run with `dr_grpo`.
|
| 51 |
+
- **GSM8K GRPO.** The `examples/gsm8k_grpo*` end-to-end examples exercise the production
|
| 52 |
+
trainer on a real reasoning benchmark.
|
| 53 |
+
- **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0
|
| 54 |
+
errors (Spike 001).
|
| 55 |
+
- **Installable + tested.** `pip install -e .` works; **115 passing tests + 1 skip-marked**
|
| 56 |
+
(canonical count: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).
|
| 57 |
+
|
| 58 |
+
## What's gapped (honest, NOT closed)
|
| 59 |
+
|
| 60 |
+
1. **Docker / TorchForge substrate E2E** is **hardware-blocked** — the test exists and skips
|
| 61 |
+
cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end.
|
| 62 |
+
2. **The full 8B LMA channel ladder (A2–A4) is not yet runnable.** Only **A1 (GRPO-only)**
|
| 63 |
+
has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold +
|
| 64 |
+
plan-builder only — running them on a real 8B checkpoint additionally needs a real
|
| 65 |
+
error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that
|
| 66 |
+
don't exist yet. The real 8B run is *additionally* user-budget-gated.
|
| 67 |
+
3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
|
| 68 |
+
GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
|
| 69 |
+
|
| 70 |
+
See [`BACKLOG.md`](../BACKLOG.md) for the live gap list and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
|
| 71 |
+
for known foot-guns.
|
| 72 |
+
|
| 73 |
+
## Foot-guns worth knowing on day one
|
| 74 |
+
|
| 75 |
+
- **Branch sync (resolved 2026-06-09).** `main` is canonical and kept in sync with `master`,
|
| 76 |
+
so a fresh Hub clone of `main` installs the complete tree. If you ever `ImportError` on
|
| 77 |
+
`make_dr_grpo_config`, your clone is stale (`git fetch && git checkout main`). Historically
|
| 78 |
+
`main` lagged `master`; that's fixed as long as both stay synced.
|
| 79 |
+
- **`strip_thinking` × SDPO.** On real agent traces, SDPO requires `strip_thinking=False`:
|
| 80 |
+
~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks.
|
| 81 |
+
- **KL estimator delta.** TRL uses the **k3** estimator; Composer's report describes **k1**.
|
| 82 |
+
This is a documented, intentional delta — the framework does not silently claim k1 parity.
|
| 83 |
+
- **`compose_loss` is the verification harness, not production.** Its channel-1 is an LM-CE
|
| 84 |
+
stub, not real GRPO. Production training is `ComposerReplicationTrainer`.
|
| 85 |
+
|
| 86 |
+
## Where to go next
|
| 87 |
+
|
| 88 |
+
| You want to… | Read |
|
| 89 |
+
|---|---|
|
| 90 |
+
| Pitch / status / roadmap | [`README.md`](../README.md) |
|
| 91 |
+
| Run it end-to-end | [`docs/USER_GUIDE.md`](USER_GUIDE.md) |
|
| 92 |
+
| Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch | [`docs/INTEGRATION_RECIPES.md`](INTEGRATION_RECIPES.md) |
|
| 93 |
+
| Exact kwargs / signatures | [`docs/API_REFERENCE.md`](API_REFERENCE.md) |
|
| 94 |
+
| Why each design decision | [`docs/adrs/README.md`](adrs/README.md) |
|
| 95 |
+
| How Cursor's recipe maps to our components | [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) |
|
| 96 |
+
| Honest gaps / open work | [`BACKLOG.md`](../BACKLOG.md), [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md) |
|
| 97 |
+
| Fix a broken install / run | [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) |
|
docs/USER_GUIDE.md
CHANGED
|
@@ -62,6 +62,13 @@ cd composer-replication-framework
|
|
| 62 |
pip install -e .
|
| 63 |
```
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
|
| 66 |
verification harness on CPU (sections 3, 5, 6).
|
| 67 |
|
|
|
|
| 62 |
pip install -e .
|
| 63 |
```
|
| 64 |
|
| 65 |
+
> **Branch note (resolved 2026-06-09).** `main` is the canonical branch and is kept in
|
| 66 |
+
> sync with `master` (`main == master`). A fresh clone of `main` has the complete tree
|
| 67 |
+
> (incl. `make_dr_grpo_config` / `make_po_config`), so no branch switch is needed.
|
| 68 |
+
> Historically `main` lagged `master` — if you ever see an `ImportError` on those symbols,
|
| 69 |
+
> the clone is stale; `git fetch && git checkout main` (or pin a current SHA) fixes it.
|
| 70 |
+
> See [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).
|
| 71 |
+
|
| 72 |
That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the
|
| 73 |
verification harness on CPU (sections 3, 5, 6).
|
| 74 |
|
docs/VISION_VALIDATION.md
CHANGED
|
@@ -1,11 +1,28 @@
|
|
| 1 |
# Vision Validation: Does the Framework Encapsulate the Original Brief?
|
| 2 |
|
| 3 |
-
> **## Status as of 2026-
|
| 4 |
-
> The framework is past-skeleton: 8 subpackages (`composer_replication/*`),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
>
|
| 6 |
-
> **Two remaining honest gaps:**
|
| 7 |
-
> 1. Docker/TorchForge substrate E2E is hardware-blocked
|
| 8 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
> **Status:** Self-audit, 2026-05-25 (Wave 6).
|
| 11 |
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
|
|
@@ -259,7 +276,7 @@ In recommended-do-next order:
|
|
| 259 |
1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
|
| 260 |
2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
|
| 261 |
3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
|
| 262 |
-
4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/
|
| 263 |
5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
|
| 264 |
|
| 265 |
Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
|
|
|
|
| 1 |
# Vision Validation: Does the Framework Encapsulate the Original Brief?
|
| 2 |
|
| 3 |
+
> **## Status as of 2026-06 (current through ADR-014)**
|
| 4 |
+
> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
|
| 5 |
+
> tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
|
| 6 |
+
> canonical count), and operational end-to-end examples (`gsm8k_grpo`,
|
| 7 |
+
> `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation,
|
| 8 |
+
> trace-ingestion, and DiLoCo have all shipped and been cross-family reviewed. The base
|
| 9 |
+
> RL objective is now a **selectable menu** (default Dr.GRPO; ADR-014) rather than
|
| 10 |
+
> hardcoded. **Channel 3 (trace-replay-DPO) is the framework's own additive research
|
| 11 |
+
> channel — not part of Cursor's Composer recipe** (Composer = channels 1 Dr.GRPO + 2
|
| 12 |
+
> SDPO only; ADR-014).
|
| 13 |
>
|
| 14 |
+
> **Two remaining honest gaps (NOT closed):**
|
| 15 |
+
> 1. Docker/TorchForge substrate E2E is hardware-blocked — the test exists and skips
|
| 16 |
+
> cleanly, but lacking a local multi-GPU rig the orchestrator layer is unrun.
|
| 17 |
+
> 2. The 8B LMA channel-ladder is **not fully runnable today**: only **A1 (GRPO-only)**
|
| 18 |
+
> has a real Modal runner — [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)
|
| 19 |
+
> records that "the A1 run used `dr_grpo`" and that threading the `objective=` menu
|
| 20 |
+
> through the rest of the ladder runners is an open follow-up. **A2 (SDPO) / A3
|
| 21 |
+
> (replay-DPO) / A4 (combined)** are scaffold + plan-builder only; running them on a
|
| 22 |
+
> real 8B checkpoint additionally needs a real error-trace SDPO dataset, a replay-DPO
|
| 23 |
+
> preference corpus, and an A100 entrypoint that don't exist yet. The real 8B run is
|
| 24 |
+
> *additionally* user-budget-gated (the sole remaining acceptance-gate box in
|
| 25 |
+
> [ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)).
|
| 26 |
|
| 27 |
> **Status:** Self-audit, 2026-05-25 (Wave 6).
|
| 28 |
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
|
|
|
|
| 276 |
1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
|
| 277 |
2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
|
| 278 |
3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
|
| 279 |
+
4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen_05b_quickstart/` directory + entry-point exposure.
|
| 280 |
5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.
|
| 281 |
|
| 282 |
Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.
|
docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md
CHANGED
|
@@ -1,75 +1,9 @@
|
|
| 1 |
# Wave: Composer 2.5 data-gen + targeted-RL textual-feedback (2026-05-29)
|
| 2 |
|
| 3 |
-
>
|
| 4 |
-
>
|
| 5 |
-
>
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
>
|
| 10 |
-
> to composer 2.5 is able to … train a model like nanochat on modal … also …
|
| 11 |
-
> diloco … bring the composer 2.5 blog's discussion on dataset generation into
|
| 12 |
-
> the fray … the composer 2.5 RL method of targeted RL with textual feedback
|
| 13 |
-
> should also be included in the RL framework as well as the dataset generation
|
| 14 |
-
> framework."
|
| 15 |
-
|
| 16 |
-
## What shipped (all on `hf/master`)
|
| 17 |
-
|
| 18 |
-
| Capability | Status | Artifact |
|
| 19 |
-
|---|---|---|
|
| 20 |
-
| **Targeted RL w/ textual feedback** (SDPO) live in a Dr. GRPO loop | ✅ ADR-008 accepted | `trainer/composer_trainer.py` (`make_dr_grpo_config`, strict SDPO alignment), `examples/composer_grpo_sdpo_smoke/` |
|
| 21 |
-
| **Hint generation** (the unstated #1 gap) — layered generator | ✅ ADR-009 accepted | `hint_generator.py` (Template→RawError→LLMJudge→sibling-bootstrap, `as_collator_hook`) |
|
| 22 |
-
| **Dataset generation** (Feature Deletion) | ✅ ADR-010 accepted | `composer_replication/datagen/` (env, sandbox+safeguards, monitor, curriculum, validator, SWE-substrate adapter) |
|
| 23 |
-
| nanochat-on-Modal + DiLoCo A/B | ✅ (prior wave, same session) | DiLoCo tied vanilla within noise; SFT ChatCORE 0.3076; RL Pass@1 doubled |
|
| 24 |
-
| SDPO loss math end-to-end on real traces | ✅ (prior wave, same session) | `examples/sdpo_real_trace_train_smoke/` |
|
| 25 |
-
|
| 26 |
-
## Key research resolutions (5 docs, `research/06-10`)
|
| 27 |
-
|
| 28 |
-
- **RL algorithm = Dr. GRPO** (Composer 2 tech report arXiv:2603.24477): length-standardization removed, no std-dev advantage norm, Adam, single-epoch, k1 (`−log r`) KL. TRL 1.5.0 supports this natively (`loss_type="dr_grpo"`, `scale_rewards="none"`).
|
| 29 |
-
- **Hint generation is unstated in every Cursor artifact** — so the SDPO/OPSD reconstruction is the legitimate path, not a missed detail. SDPO's "successful-rollout-as-implicit-feedback" is the bootstrap fallback.
|
| 30 |
-
- **Composer 2 hint-free behavior-shaping alternative** documented (aux scalar rewards + nonlinear length/effort penalty) for when hints aren't available.
|
| 31 |
-
- Corrections to the mapping doc: optimizer Adam (not Muon — 2.5-blog-only); sharding FSDP+CP+decoupled-EP (not HSDP).
|
| 32 |
-
|
| 33 |
-
## Test posture
|
| 34 |
-
|
| 35 |
-
- 187 passed / 16 skipped across the full package (no regressions).
|
| 36 |
-
- New: 7 (Dr.GRPO config + SDPO alignment), 12 (layered hints), 19 (datagen).
|
| 37 |
-
- The SDPO channel is proven both as loss-math-on-real-traces AND wired into a live `trl.GRPOTrainer` Dr. GRPO step.
|
| 38 |
-
|
| 39 |
-
## Reusable for the adjacent project
|
| 40 |
-
|
| 41 |
-
`composer_replication.datagen` is the data-gen primitive Codeseys wanted for
|
| 42 |
-
another project: **"invert a solved-repo dataset into a reimplement-to-pass
|
| 43 |
-
verifiable-reward task,"** with reward-hacking safeguards and an online
|
| 44 |
-
difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
|
| 45 |
-
(SWE-Gym, R2E-Gym, SWE-rebench — tens of thousands of tasks) a drop-in source.
|
| 46 |
-
|
| 47 |
-
## Owed / unblocked-by (constraints this wave hit)
|
| 48 |
-
|
| 49 |
-
- **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
|
| 50 |
-
V4 Pro / Kimi K2.6 / Grok 4.3) — ✅ **DONE 2026-05-29** once OpenRouter credits
|
| 51 |
-
were restored. 4/5 clean reviews (Kimi starved its budget). Verdicts: ADR-008
|
| 52 |
-
3-REJECT/1-fix, ADR-009 unanimous-fix, ADR-010 3-REJECT/1-fix — the REJECTs all
|
| 53 |
-
amounted to "accepted-status outran the evidence," not "design is wrong." The
|
| 54 |
-
review **caught a real policy-corrupting P0 the solo pre-mortem missed**: the
|
| 55 |
-
SDPO alignment guard was a shape-check that let hint-shifted tokens silently
|
| 56 |
-
misalign. That plus 6 other convergent defects (sandbox scrub unimplemented,
|
| 57 |
-
curriculum partial-reward→full-pass, reward_fn truncation, shell injection,
|
| 58 |
-
LLM-judge cache non-determinism, scale_rewards assert dup) were **fixed and
|
| 59 |
-
tested** (192 pass / 16 skip, was 187). Full synthesis + the 4 reviews:
|
| 60 |
-
`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`. Each ADR now carries a
|
| 61 |
-
"Post-acceptance cross-family review" section documenting fixed vs open items.
|
| 62 |
-
- **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
|
| 63 |
-
image, run the 4 gates against it) — STILL deferred (no Docker in the CPU dev
|
| 64 |
-
env). The review confirmed this is the gate that matters: it closes the
|
| 65 |
-
remaining OPEN findings (real reachability/provenance, FakeSandbox-tautology
|
| 66 |
-
objection). Marked `[~]` in ADR-010. Needs a box with Docker + a SWE-bench-Lite
|
| 67 |
-
image.
|
| 68 |
-
- Gateway restarted 6× during the original wave; all work kept durable via
|
| 69 |
-
phase-boundary commits + detached `systemd-run` scopes.
|
| 70 |
-
|
| 71 |
-
## Commit trail
|
| 72 |
-
|
| 73 |
-
`7090729` (SDPO smoke) → `6049d00` (research) → `36ab61e` (ADRs) →
|
| 74 |
-
`bde5c5e` (ADR-008 code) → `2a34df4` (ADR-008 smoke+accept) →
|
| 75 |
-
`84740d4` (ADR-009) → `9336af3` (ADR-010).
|
|
|
|
| 1 |
# Wave: Composer 2.5 data-gen + targeted-RL textual-feedback (2026-05-29)
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This point-in-time wave log has been moved to
|
| 4 |
+
> [`docs/_archive/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md`](_archive/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md).
|
| 5 |
+
> It is preserved verbatim for provenance but is superseded by the current
|
| 6 |
+
> [`docs/METHODOLOGY.md`](METHODOLOGY.md), [`BACKLOG.md`](../BACKLOG.md), and the
|
| 7 |
+
> accepted ADRs (notably [ADR-008](adrs/ADR-008-drgrpo-sdpo-live-channel.md) and
|
| 8 |
+
> [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md)). See
|
| 9 |
+
> [`docs/_archive/README.md`](_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/_archive/DEEP_WORK_LOOP_LOG.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep Work Loop Log — Composer 2.5 Replication Framework
|
| 2 |
+
|
| 3 |
+
Started: 2026-05-26
|
| 4 |
+
Operator: Codeseys (Hermes Agent autonomous loop)
|
| 5 |
+
Skill: `deep-work-loop` v1.0.0
|
| 6 |
+
|
| 7 |
+
## Vision
|
| 8 |
+
|
| 9 |
+
> Take any HuggingFace model → further RL train it using:
|
| 10 |
+
> 1. RLVR (tests-pass reward),
|
| 11 |
+
> 2. SDPO/hint-distillation (Composer 2.5's "targeted RL with textual feedback"),
|
| 12 |
+
> 3. multi-teacher trace-replay DPO,
|
| 13 |
+
> integrated against TRL/VeRL/OpenEnv with DiLoCo-style outer loop sync.
|
| 14 |
+
>
|
| 15 |
+
> Output: a published, reproducible framework — the "Composer 2.5 replication" the open ecosystem is missing.
|
| 16 |
+
|
| 17 |
+
## Starting state
|
| 18 |
+
|
| 19 |
+
- HEAD: `040eff8` (Wave 6: vision validation self-audit, 5/10 scorecard)
|
| 20 |
+
- Tests: 38/38 green in `spikes/005-integrated-trainer-skeleton/`
|
| 21 |
+
- Working tree: clean
|
| 22 |
+
|
| 23 |
+
## Phase ledger
|
| 24 |
+
|
| 25 |
+
| Phase | Description | Status | Started | Done |
|
| 26 |
+
|---|---|---|---|---|
|
| 27 |
+
| 1 | commit-state | ✅ | 2026-05-26 | 2026-05-26 |
|
| 28 |
+
| 2 | backlog-audit (BACKLOG.md from VISION_VALIDATION) | ✅ | 2026-05-26 | 2026-05-26 |
|
| 29 |
+
| 3 | parallel-research (3 subagents) | 🟡 | 2026-05-26 | |
|
| 30 |
+
| 4 | architect with ADRs (ADR-001..003) | ⏳ | | |
|
| 31 |
+
| 5 | plan in waves (W7–W10) | ⏳ | | |
|
| 32 |
+
| 6 | execute W7 — Spike 006 (real HF model smoke) | ⏳ | | |
|
| 33 |
+
| 7 | execute W8 — Spike 007 (real trace ingestion) | ⏳ | | |
|
| 34 |
+
| 8 | execute W9 — Spike 008 (DiLoCo smoke) | ⏳ | | |
|
| 35 |
+
| 9 | execute W10 — packaging | ⏳ | | |
|
| 36 |
+
| 10 | (Modal-gated) Spike 002a-mini real GPU smoke | ⏳ | | |
|
| 37 |
+
| 11 | cross-model-final-review | ⏳ | | |
|
| 38 |
+
| 12 | update scorecard + push | ⏳ | | |
|
| 39 |
+
|
| 40 |
+
## Constraints
|
| 41 |
+
|
| 42 |
+
- Verify ALL claims against primary sources (Wave 2 lesson — subagent synthesis is not evidence).
|
| 43 |
+
- Tests must pass before commit.
|
| 44 |
+
- Memory L1 is at 99% — write to L2 wiki + L3 fact_store, not L1.
|
| 45 |
+
- Modal budget: $20 hard cap for this loop. Anything more goes to user for approval.
|
| 46 |
+
- No `upload_file` mixing with `git push` — `git push hf master:main` only.
|
| 47 |
+
- Commit messages via `-F /tmp/<wave>-commit-msg.txt`.
|
docs/_archive/README.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# docs/_archive — historical, point-in-time artifacts
|
| 2 |
+
|
| 3 |
+
This directory holds **non-research** documents that captured the state of the
|
| 4 |
+
project at a specific moment (wave logs, dated cross-family review bundles).
|
| 5 |
+
They are preserved **verbatim for provenance** and are **not** maintained as
|
| 6 |
+
current truth.
|
| 7 |
+
|
| 8 |
+
> **What's current instead.** For the live state of the framework, read
|
| 9 |
+
> [`docs/OVERVIEW.md`](../OVERVIEW.md), [`README.md`](../../README.md),
|
| 10 |
+
> [`BACKLOG.md`](../../BACKLOG.md), [`docs/METHODOLOGY.md`](../METHODOLOGY.md),
|
| 11 |
+
> [`docs/V1_V8_COVERAGE.md`](../V1_V8_COVERAGE.md), and the accepted ADRs under
|
| 12 |
+
> [`docs/adrs/`](../adrs/README.md). Where an archived doc and a current doc
|
| 13 |
+
> disagree, the current doc wins.
|
| 14 |
+
|
| 15 |
+
Each archived file's original path still contains a one-line **redirect stub** so
|
| 16 |
+
that older prose references (including those baked into immutable accepted ADRs)
|
| 17 |
+
keep resolving.
|
| 18 |
+
|
| 19 |
+
## Contents
|
| 20 |
+
|
| 21 |
+
| Archived file | Original path | What it is | Superseded by |
|
| 22 |
+
|---|---|---|---|
|
| 23 |
+
| [`WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md`](WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md) | `docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md` | Wave log: Composer datagen + targeted-RL textual-feedback work, 2026-05-29 | ADR-008, ADR-009, ADR-010, ADR-014; `BACKLOG.md` |
|
| 24 |
+
| [`DEEP_WORK_LOOP_LOG.md`](DEEP_WORK_LOOP_LOG.md) | `docs/DEEP_WORK_LOOP_LOG.md` | Running log of deep-work-loop sessions | `docs/METHODOLOGY.md`; `BACKLOG.md` |
|
| 25 |
+
| [`reviews/cross-family-adr-008-009-010-2026-05-29/`](reviews/cross-family-adr-008-009-010-2026-05-29/) | `docs/reviews/cross-family-adr-008-009-010-2026-05-29/` | 5-family adversarial review of ADR-008/009/010 (SYNTHESIS + 4 per-model reviews) | ADR-008 / ADR-012 acceptance records |
|
| 26 |
+
| [`reviews/final-verify-deep-work-2026-05-29/`](reviews/final-verify-deep-work-2026-05-29/) | `docs/reviews/final-verify-deep-work-2026-05-29/` | Phase-8 final cross-family verify (SYNTHESIS + per-model verifies) | accepted ADRs; `BACKLOG.md` |
|
| 27 |
+
|
| 28 |
+
## Why archive instead of delete
|
| 29 |
+
|
| 30 |
+
These documents record *how the decisions were reached* — the adversarial
|
| 31 |
+
findings, the wave-by-wave reasoning, the cross-model disagreements. That
|
| 32 |
+
provenance is valuable for anyone auditing the framework's claims, and the
|
| 33 |
+
accepted ADRs cite these paths by name. Deleting them would orphan those
|
| 34 |
+
citations and erase the audit trail. Archiving keeps the trail intact while
|
| 35 |
+
making clear they are snapshots, not the live spec.
|
| 36 |
+
|
| 37 |
+
> Research-flavored point-in-time reviews (the `WAVE_*_FINAL_REVIEW.md`
|
| 38 |
+
> cross-model audits and the Wave-16 recon audit) live in the sibling
|
| 39 |
+
> [`docs/research/_archive/`](../research/_archive/README.md) instead.
|
docs/_archive/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave: Composer 2.5 data-gen + targeted-RL textual-feedback (2026-05-29)
|
| 2 |
+
|
| 3 |
+
> Deep-work-loop wave bringing Composer 2.5's **dataset generation** and
|
| 4 |
+
> **targeted RL with textual feedback** into the framework, per Codeseys'
|
| 5 |
+
> directive. This doc encapsulates what shipped and ties it to the ask.
|
| 6 |
+
|
| 7 |
+
## The ask (verbatim intent)
|
| 8 |
+
|
| 9 |
+
> "make sure that our framework for data-generation and massive deep RL similar
|
| 10 |
+
> to composer 2.5 is able to … train a model like nanochat on modal … also …
|
| 11 |
+
> diloco … bring the composer 2.5 blog's discussion on dataset generation into
|
| 12 |
+
> the fray … the composer 2.5 RL method of targeted RL with textual feedback
|
| 13 |
+
> should also be included in the RL framework as well as the dataset generation
|
| 14 |
+
> framework."
|
| 15 |
+
|
| 16 |
+
## What shipped (all on `hf/master`)
|
| 17 |
+
|
| 18 |
+
| Capability | Status | Artifact |
|
| 19 |
+
|---|---|---|
|
| 20 |
+
| **Targeted RL w/ textual feedback** (SDPO) live in a Dr. GRPO loop | ✅ ADR-008 accepted | `trainer/composer_trainer.py` (`make_dr_grpo_config`, strict SDPO alignment), `examples/composer_grpo_sdpo_smoke/` |
|
| 21 |
+
| **Hint generation** (the unstated #1 gap) — layered generator | ✅ ADR-009 accepted | `hint_generator.py` (Template→RawError→LLMJudge→sibling-bootstrap, `as_collator_hook`) |
|
| 22 |
+
| **Dataset generation** (Feature Deletion) | ✅ ADR-010 accepted | `composer_replication/datagen/` (env, sandbox+safeguards, monitor, curriculum, validator, SWE-substrate adapter) |
|
| 23 |
+
| nanochat-on-Modal + DiLoCo A/B | ✅ (prior wave, same session) | DiLoCo tied vanilla within noise; SFT ChatCORE 0.3076; RL Pass@1 doubled |
|
| 24 |
+
| SDPO loss math end-to-end on real traces | ✅ (prior wave, same session) | `examples/sdpo_real_trace_train_smoke/` |
|
| 25 |
+
|
| 26 |
+
## Key research resolutions (5 docs, `research/06-10`)
|
| 27 |
+
|
| 28 |
+
- **RL algorithm = Dr. GRPO** (Composer 2 tech report arXiv:2603.24477): length-standardization removed, no std-dev advantage norm, Adam, single-epoch, k1 (`−log r`) KL. TRL 1.5.0 supports this natively (`loss_type="dr_grpo"`, `scale_rewards="none"`).
|
| 29 |
+
- **Hint generation is unstated in every Cursor artifact** — so the SDPO/OPSD reconstruction is the legitimate path, not a missed detail. SDPO's "successful-rollout-as-implicit-feedback" is the bootstrap fallback.
|
| 30 |
+
- **Composer 2 hint-free behavior-shaping alternative** documented (aux scalar rewards + nonlinear length/effort penalty) for when hints aren't available.
|
| 31 |
+
- Corrections to the mapping doc: optimizer Adam (not Muon — 2.5-blog-only); sharding FSDP+CP+decoupled-EP (not HSDP).
|
| 32 |
+
|
| 33 |
+
## Test posture
|
| 34 |
+
|
| 35 |
+
- 187 passed / 16 skipped across the full package (no regressions).
|
| 36 |
+
- New: 7 (Dr.GRPO config + SDPO alignment), 12 (layered hints), 19 (datagen).
|
| 37 |
+
- The SDPO channel is proven both as loss-math-on-real-traces AND wired into a live `trl.GRPOTrainer` Dr. GRPO step.
|
| 38 |
+
|
| 39 |
+
## Reusable for the adjacent project
|
| 40 |
+
|
| 41 |
+
`composer_replication.datagen` is the data-gen primitive Codeseys wanted for
|
| 42 |
+
another project: **"invert a solved-repo dataset into a reimplement-to-pass
|
| 43 |
+
verifiable-reward task,"** with reward-hacking safeguards and an online
|
| 44 |
+
difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
|
| 45 |
+
(SWE-Gym, R2E-Gym, SWE-rebench — tens of thousands of tasks) a drop-in source.
|
| 46 |
+
|
| 47 |
+
## Owed / unblocked-by (constraints this wave hit)
|
| 48 |
+
|
| 49 |
+
- **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
|
| 50 |
+
V4 Pro / Kimi K2.6 / Grok 4.3) — ✅ **DONE 2026-05-29** once OpenRouter credits
|
| 51 |
+
were restored. 4/5 clean reviews (Kimi starved its budget). Verdicts: ADR-008
|
| 52 |
+
3-REJECT/1-fix, ADR-009 unanimous-fix, ADR-010 3-REJECT/1-fix — the REJECTs all
|
| 53 |
+
amounted to "accepted-status outran the evidence," not "design is wrong." The
|
| 54 |
+
review **caught a real policy-corrupting P0 the solo pre-mortem missed**: the
|
| 55 |
+
SDPO alignment guard was a shape-check that let hint-shifted tokens silently
|
| 56 |
+
misalign. That plus 6 other convergent defects (sandbox scrub unimplemented,
|
| 57 |
+
curriculum partial-reward→full-pass, reward_fn truncation, shell injection,
|
| 58 |
+
LLM-judge cache non-determinism, scale_rewards assert dup) were **fixed and
|
| 59 |
+
tested** (192 pass / 16 skip, was 187). Full synthesis + the 4 reviews:
|
| 60 |
+
`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`. Each ADR now carries a
|
| 61 |
+
"Post-acceptance cross-family review" section documenting fixed vs open items.
|
| 62 |
+
- **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
|
| 63 |
+
image, run the 4 gates against it) — STILL deferred (no Docker in the CPU dev
|
| 64 |
+
env). The review confirmed this is the gate that matters: it closes the
|
| 65 |
+
remaining OPEN findings (real reachability/provenance, FakeSandbox-tautology
|
| 66 |
+
objection). Marked `[~]` in ADR-010. Needs a box with Docker + a SWE-bench-Lite
|
| 67 |
+
image.
|
| 68 |
+
- Gateway restarted 6× during the original wave; all work kept durable via
|
| 69 |
+
phase-boundary commits + detached `systemd-run` scopes.
|
| 70 |
+
|
| 71 |
+
## Commit trail
|
| 72 |
+
|
| 73 |
+
`7090729` (SDPO smoke) → `6049d00` (research) → `36ab61e` (ADRs) →
|
| 74 |
+
`bde5c5e` (ADR-008 code) → `2a34df4` (ADR-008 smoke+accept) →
|
| 75 |
+
`84740d4` (ADR-009) → `9336af3` (ADR-010).
|
docs/_archive/reviews/cross-family-adr-008-009-010-2026-05-29/SYNTHESIS.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Cross-family adversarial review — ADR-008 / 009 / 010 (2026-05-29)
|
| 2 |
+
|
| 3 |
+
> The "owed" cross-family adversarial ADR review from the Composer 2.5
|
| 4 |
+
> data-gen + targeted-RL wave. Deferred during the wave because OpenRouter ran
|
| 5 |
+
> out of credits mid-run (blocking `delegate_task`); run here once credits were
|
| 6 |
+
> restored (~$50 free at review time). This is the proper GPT-5.5 / Gemini /
|
| 7 |
+
> DeepSeek / Kimi / Grok scatter the wave summary promised.
|
| 8 |
+
|
| 9 |
+
## Method
|
| 10 |
+
|
| 11 |
+
Route-fidelity-safe urllib scatter (no `delegate_task` per-task-override risk)
|
| 12 |
+
to 5 diverse families, full corpus embedded inline: the 3 ADRs **plus their
|
| 13 |
+
implementation code plus the tests** (~100KB / ~25K tokens), so reviewers
|
| 14 |
+
critiqued the *implementation against the decision*, not just the prose.
|
| 15 |
+
|
| 16 |
+
| Family | Slug (served) | Result |
|
| 17 |
+
|---|---|---|
|
| 18 |
+
| OpenAI | gpt-5.5-20260423 | clean, very specific (7981 tok) |
|
| 19 |
+
| Google | gemini-3.1-pro-preview-20260219 | clean, sharpest (9174 tok @ retry 16K) |
|
| 20 |
+
| DeepSeek | deepseek-v4-pro-20260423 | clean, math/correctness lens (10308 tok @ retry 16K) |
|
| 21 |
+
| xAI | grok-4.3-20260430 | clean, terse/decisive (1645 tok) |
|
| 22 |
+
| Moonshot | kimi-k2.6-20260420 | starved token budget (63K reasoning, content=None) — excluded |
|
| 23 |
+
|
| 24 |
+
4 of 5 clean reviews. Raw reviews: `review_*.md` in this directory.
|
| 25 |
+
|
| 26 |
+
## Verdict matrix
|
| 27 |
+
|
| 28 |
+
| ADR | GPT-5.5 | Gemini | DeepSeek | Grok | Consensus |
|
| 29 |
+
|---|---|---|---|---|---|
|
| 30 |
+
| 008 (Dr.GRPO+SDPO) | REJECT | REJECT | ACCEPT-w-FIXES | REJECT | **3 REJECT / 1 fix** |
|
| 31 |
+
| 009 (layered hints) | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT | **unanimous: fixable** |
|
| 32 |
+
| 010 (FeatureDeletion datagen) | REJECT | REJECT | REJECT | ACCEPT-w-FIXES | **3 REJECT / 1 fix** |
|
| 33 |
+
|
| 34 |
+
The REJECTs were not "the design is wrong" — every reviewer agreed the
|
| 35 |
+
architecture is sound. They were "the **accepted** status outruns the
|
| 36 |
+
**evidence**": gates marked `[x]` were satisfied by shape-checks and FakeSandbox
|
| 37 |
+
tautologies, not by the hard guarantee the gate language implied.
|
| 38 |
+
|
| 39 |
+
## Convergent findings (≥2 reviewers, ALL verified against code by the orchestrator)
|
| 40 |
+
|
| 41 |
+
| # | Finding | Reviewers | Status |
|
| 42 |
+
|---|---|---|---|
|
| 43 |
+
| 1 | SDPO alignment guard = shape-check only; hint-shifted tokens silently misalign → policy poisoning | 4/4 | **FIXED** (explicit alignment-index gather + strict-raise; tests rewritten) |
|
| 44 |
+
| 2 | Sandbox denylist trivially bypassable (`python -c`, abs paths, `sh -c`) + ADR-claimed scrub UNIMPLEMENTED | 4/4 | **FIXED** (`_scrub_tree` primary control implemented + tested) |
|
| 45 |
+
| 3 | Curriculum `int(reward>0)` logs 0.5 partial as full pass → premature retire | 3/4 | **FIXED** (fractional credit + tests) |
|
| 46 |
+
| 4 | `scale_rewards` assertion dup literal `("none","False","False")` | 4/4 | **FIXED** (case-insensitive) |
|
| 47 |
+
| 5 | CPU smoke is tautological — never enters `_compute_sdpo_loss` (early return) | 3/4 | **RE-SCOPED** (gate honestly reworded) |
|
| 48 |
+
| 6 | FakeSandbox validator tests prove plumbing, not real inversion | 3/4 | **RE-SCOPED** (= the `[~]` Docker gate) |
|
| 49 |
+
| 7 | HackMonitor is substring matcher, not AST-provenance as advertised | 2/4 | **RE-SCOPED + OPEN** (follow-up) |
|
| 50 |
+
| 8 | Validator gate-2/"deletion reachable" doesn't test reachability | 2/4 | **OPEN** (needs Docker materializers) |
|
| 51 |
+
| 9 | `shell=True` + node-id interpolation = injection + param-test breakage | 2/4 | **FIXED** (`shlex.quote`) |
|
| 52 |
+
| 10 | LLM-judge cache key non-deterministic (memory addr) + unbounded output | 2/4 | **FIXED** (addr-strip + version + clamp + atomic write) |
|
| 53 |
+
| 11 | KL estimator k1 never configured/asserted | 2/4 | **OPEN** (fidelity follow-up) |
|
| 54 |
+
| 12 | `reward_fn` zip truncates on length mismatch | 1 (GPT) | **FIXED** (length-guard) |
|
| 55 |
+
|
| 56 |
+
## What was fixed in this pass (all tested, 192 pass / 16 skip, 0 fail)
|
| 57 |
+
|
| 58 |
+
- **SDPO alignment** (`composer_trainer.py`): require `student_response_idx` /
|
| 59 |
+
`teacher_response_idx` from the collator; gather aligned post-hint logits
|
| 60 |
+
before JSD; strict-mode raises on missing/mismatched indices. The
|
| 61 |
+
silent-misalignment P0 (the only finding that would actually corrupt a run) is
|
| 62 |
+
closed and tested across different-length sequences.
|
| 63 |
+
- **Sandbox scrub** (`sandbox.py`): `boot()` physically removes byte-code/type
|
| 64 |
+
caches + `.git`/`.hg` — the ADR-claimed PRIMARY reward-hack control, previously
|
| 65 |
+
absent. Denylist re-documented as defense-in-depth + `shlex.quote` on node ids.
|
| 66 |
+
- **Curriculum** (`curriculum.py` + `env.py`): fractional pass credit; hacked /
|
| 67 |
+
guard-broken rollouts contribute 0 credit but count as exposures.
|
| 68 |
+
- **reward_fn** (`env.py`): length-guard against silent truncation.
|
| 69 |
+
- **LLM judge** (`hint_generator.py`): address-stripped + versioned cache key,
|
| 70 |
+
600-char output clamp, atomic disk write.
|
| 71 |
+
- **`make_dr_grpo_config`** (`composer_trainer.py`): fixed the duplicated
|
| 72 |
+
assertion literal.
|
| 73 |
+
|
| 74 |
+
## What remains OPEN (honest follow-ups, none corrupt a run)
|
| 75 |
+
|
| 76 |
+
1. **Live Docker substrate-inversion e2e** — the central `[~]` gate. Closes
|
| 77 |
+
findings #6, #8 (real reachability/provenance on a materialized repo) and the
|
| 78 |
+
"FakeSandbox is tautological" objection. **Blocked: no Docker in this env.**
|
| 79 |
+
2. **k1 KL estimator** assert/config vs TRL 1.5.0's actual GRPO KL branch (#11).
|
| 80 |
+
3. **HackMonitor → real AST provenance** check (#7).
|
| 81 |
+
4. **Hint-generator default layer routing** so style/communication sites reach
|
| 82 |
+
the judge (ADR-009 P1).
|
| 83 |
+
5. **Curriculum turn/think-token signals** for full Composer recipe fidelity.
|
| 84 |
+
|
| 85 |
+
All five are documented in the ADRs' "Post-acceptance cross-family review"
|
| 86 |
+
sections.
|
| 87 |
+
|
| 88 |
+
## Cost
|
| 89 |
+
|
| 90 |
+
~$2-4 OpenRouter (2 scatters: 5×8K + 3×16K tokens on a 25K-token corpus).
|
docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_deepseek-v4-pro.md
RENAMED
|
File without changes
|
docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_gemini-3.1-pro.md
RENAMED
|
File without changes
|
docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_gpt-5.5.md
RENAMED
|
File without changes
|
docs/{reviews → _archive/reviews}/cross-family-adr-008-009-010-2026-05-29/review_grok-4.3.md
RENAMED
|
File without changes
|
docs/_archive/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase-8 final cross-family verify — deep-work-loop 2026-05-29
|
| 2 |
+
|
| 3 |
+
The deep-work-loop exit gate (Hard Rule 6: two independent confirmations). After
|
| 4 |
+
Waves A+B shipped (ADR-011/012/013, 227 tests), the net session code diff
|
| 5 |
+
(~92KB) was scattered to 3 diverse families for a final "would this break a real
|
| 6 |
+
run" pass. 2 of 3 returned clean (GPT-5.5, Gemini; DeepSeek starved its reasoning
|
| 7 |
+
budget). The verify EARNED ITS KEEP — it found a latent P0 + a convergent bug +
|
| 8 |
+
a self-bug in this session's own new code.
|
| 9 |
+
|
| 10 |
+
## Findings (all verified against code, all FIXED)
|
| 11 |
+
|
| 12 |
+
| # | Finding | Severity | Reviewers | Fix |
|
| 13 |
+
|---|---|---|---|---|
|
| 14 |
+
| 1 | SDPO sentinel-mask used only `student_response_valid`, ignored `teacher_response_valid` — a future divergent teacher tail would distill against clamped position 0 | P0 (latent) | GPT-5.5 | `aligned_mask = (s_idx>=0)&(t_idx>=0)&student_valid&teacher_valid` |
|
| 15 |
+
| 2 | Curriculum `_TaskStats` shared one `n_effort` denominator for both turns + think_tokens → corrupts a mean when only one signal is present | P1 (convergent) | GPT-5.5 + Gemini | separate `n_turns`/`n_think` counters |
|
| 16 |
+
| 3 | Docker E2E test ran `pip install pytest` under `--network none` → would fail exactly when the gate activates | P1 | GPT-5.5 | dropped pytest; stdlib `python -c` expression runner, no network needed |
|
| 17 |
+
| 4 | MMLU reward: `Answer: A or B` hedge parsed as `A` with `n_distinct=1` → full credit | P1 | GPT-5.5 | `_HEDGE_RE` adds the hedged second letter to the distinct set → multiple-answers penalty |
|
| 18 |
+
| 5 | HackMonitor read-markers bare-substring matched `import` in `important`, `cat` in `concatenate`; and scanned the submitted patch as read-evidence → false positives | P1 | GPT-5.5 | whole-word `_READ_VERB_RE` + exclude `submit_patch`/`patch`/`diff` payloads from BOTH layers |
|
| 19 |
+
|
| 20 |
+
Finding 5's fix surfaced a follow-on: layer-1's signature matcher ALSO scanned
|
| 21 |
+
the patch payload (a legit patch mentioning `__pycache__` in a comment got
|
| 22 |
+
flagged). The regression test caught it; fixed by excluding the patch payload
|
| 23 |
+
from layer-1 too. Net: the monitor now only treats actual read ACTIONS as
|
| 24 |
+
provenance evidence, never the agent's own output patch.
|
| 25 |
+
|
| 26 |
+
## Outcome
|
| 27 |
+
|
| 28 |
+
All 5 findings fixed + 5 regression tests added. Final suite: **232 passed / 18
|
| 29 |
+
skipped, 0 failed**. Both the execution view (workers reported done) and the
|
| 30 |
+
review view (final verify findings all closed) confirm the loop is empty — the
|
| 31 |
+
two independent confirmations the deep-work-loop requires.
|
| 32 |
+
|
| 33 |
+
Raw reviews: `verify_gpt-5.5.md`, `verify_gemini-3.1-pro.md`.
|
docs/{reviews → _archive/reviews}/final-verify-deep-work-2026-05-29/verify_gemini-3.1-pro.md
RENAMED
|
File without changes
|
docs/{reviews → _archive/reviews}/final-verify-deep-work-2026-05-29/verify_gpt-5.5.md
RENAMED
|
File without changes
|
docs/_refine-2026-06-SUMMARY.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Docs Refine 2026-06 — Change Summary
|
| 2 |
+
|
| 3 |
+
> Branch: `docs/refine-2026-06` (off `master` HEAD `aae66fa`). **Docs-only.** Not merged,
|
| 4 |
+
> no PR opened — left for human review. Commit range: `aae66fa..e130879` (3 commits).
|
| 5 |
+
|
| 6 |
+
This engagement refined the documentation corpus to (1) enforce the ground-truth provenance
|
| 7 |
+
correction recorded in [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md), (2)
|
| 8 |
+
archive point-in-time historical artifacts without breaking references, and (3) add a single
|
| 9 |
+
honest newcomer overview. **No `.py`, `pyproject.toml`, or any file under
|
| 10 |
+
`composer_replication/ examples/ spikes/ tests/` was touched** — proven by
|
| 11 |
+
`git diff --name-only aae66fa..HEAD` showing only `.md` paths (see end of this doc).
|
| 12 |
+
|
| 13 |
+
## Method
|
| 14 |
+
|
| 15 |
+
Plan → parallel read-only audit (4 agents over the living docs) → apply fixes in the main
|
| 16 |
+
thread → two independent adversarial review passes (one per the post-wave-1 commit, one over
|
| 17 |
+
the whole changeset) → iterate to convergence. Both adversaries + a deterministic link/invariant
|
| 18 |
+
script signed off with zero blockers.
|
| 19 |
+
|
| 20 |
+
## Commits
|
| 21 |
+
|
| 22 |
+
| SHA | Wave | Theme |
|
| 23 |
+
|---|---|---|
|
| 24 |
+
| `20e3bd9` | Wave 1 | Correctness: channel-3 provenance, gap honesty, dead links |
|
| 25 |
+
| `f00833d` | Wave 2 | Archive point-in-time wave reviews + dated review bundles (move + redirect stub) |
|
| 26 |
+
| `e130879` | Wave 3 | Add `docs/OVERVIEW.md`, index ADR-014, fold in adversarial-review corrections |
|
| 27 |
+
|
| 28 |
+
## Files touched — what changed and why
|
| 29 |
+
|
| 30 |
+
### Wave 1 — correctness (commit `20e3bd9`)
|
| 31 |
+
|
| 32 |
+
| File | Change | Fact |
|
| 33 |
+
|---|---|---|
|
| 34 |
+
| `README.md` | v0.1 roadmap cell reframed: "Full Composer recipe" = channels 1 (Dr.GRPO) + 2 (SDPO); trace-replay-DPO labelled the framework's own addition with an ADR-014 link. | A |
|
| 35 |
+
| `docs/HF_REPO_LAYOUT.md` | v0 and v1 trained-variant rows: stop bundling trace-replay-DPO into "Composer recipe"; mark it additive. | A |
|
| 36 |
+
| `docs/VISION_VALIDATION.md` | Status banner: stale "210 passing tests" → "115 + 1 skip-marked" pointing to the canonical `V1_V8_COVERAGE.md`; note the PO-objective menu (default Dr.GRPO, ADR-014); keep both honest gaps OPEN (Docker e2e; A1-done / A2–A4-scaffold). | B, D |
|
| 37 |
+
| `docs/ALTERED_MINDS_TIE_IN.md` | Phase-3: only A1 has a real Modal runner; A2/A3/A4 scaffold + plan-builder only, blocked on dataset construction; real 8B run additionally user-gated. Added a `strip_thinking=False`-for-SDPO foot-gun note. | D, E |
|
| 38 |
+
| `docs/USER_GUIDE.md` | Clone+install block: add `git checkout master` + a branch foot-gun callout (HF `main` lags `master`; else ImportError on `make_dr_grpo_config`). | F |
|
| 39 |
+
| `BACKLOG.md` | Fixed two dead paths `examples/qwen3_05b_quickstart/` → `examples/qwen_05b_quickstart/`. | dead-link |
|
| 40 |
+
| `docs/INTEGRATION_RECIPES.md` | Fixed dead link `ADR-007-distillation-losses.md` → `ADR-007-self-distillation-losses.md`. | dead-link |
|
| 41 |
+
| `framework/composer-replication-framework.md` | Fixed 2 root-relative links that 404 from a subdirectory on HF Hub (`docs/…`, `spikes/…` → `../…`). | dead-link |
|
| 42 |
+
| `publications/HF_DISCUSSION_POST.md` | Fixed 7 root-relative links (same subdirectory-resolution issue → `../` / same-dir). | dead-link |
|
| 43 |
+
|
| 44 |
+
### Wave 2 — archive historical artifacts (commit `f00833d`)
|
| 45 |
+
|
| 46 |
+
Moved (via `git mv`, history preserved) into archives, with a **one-line redirect stub** left
|
| 47 |
+
at every original path so prose references — including those baked into immutable accepted ADRs
|
| 48 |
+
(ADR-007/008/012) and off-limits spike verdicts that cannot be edited — keep resolving:
|
| 49 |
+
|
| 50 |
+
- → `docs/research/_archive/`: `WAVE_7_10_FINAL_REVIEW.md`, `WAVE_13/14/15_FINAL_REVIEW.md`.
|
| 51 |
+
- → `docs/_archive/`: `DEEP_WORK_LOOP_LOG.md`, `WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md`.
|
| 52 |
+
- → `docs/_archive/reviews/`: the two dated review bundles
|
| 53 |
+
(`cross-family-adr-008-009-010-2026-05-29/`, `final-verify-deep-work-2026-05-29/`) — all 8
|
| 54 |
+
per-model review/verify files moved as renames; a `SYNTHESIS.md` redirect stub remains at each
|
| 55 |
+
origin (the entry point ADRs cite by directory).
|
| 56 |
+
- Added `docs/_archive/README.md` and `docs/research/_archive/README.md` indexing what was
|
| 57 |
+
archived and why ("point-in-time, superseded by current METHODOLOGY / BACKLOG / V1_V8_COVERAGE
|
| 58 |
+
/ ADRs"), extending the existing `_archive` convention (`WAVE_16_RECON_AUDIT.md` already lived
|
| 59 |
+
in `docs/research/_archive/`).
|
| 60 |
+
|
| 61 |
+
### Wave 3 — new overview + ADR index + review fixes (commit `e130879`)
|
| 62 |
+
|
| 63 |
+
| File | Change |
|
| 64 |
+
|---|---|
|
| 65 |
+
| `docs/OVERVIEW.md` | **New.** 5-minute newcomer tour: what it is, the three channels with honest provenance (1+2 = genuine Composer replication; 3 = framework's additive channel), what's proven (CPU SDPO-fires, A1 8B Modal run, GSM8K GRPO, $0.98/trace, 115 tests), what's gapped (Docker e2e, A2–A4 ladder), day-one foot-guns (main-lag, strip_thinking, k1/k3 delta, compose_loss-is-harness). Linked from README + both `_archive` READMEs. |
|
| 66 |
+
| `README.md` | Added a "🧭 New here? → OVERVIEW.md" pointer + clarified the intro that trace-replay is the framework's own addition (not Cursor's). Added the master-branch guard to the Install block (adversary finding). |
|
| 67 |
+
| `docs/adrs/README.md` | Added the missing ADR-014 row + a provenance note recording the channel-3 correction. |
|
| 68 |
+
| `docs/ALTERED_MINDS_TIE_IN.md`, `docs/VISION_VALIDATION.md` | Adversary corrections: dropped the parent-commit SHA mislabelled as "HEAD"; re-attributed the A2–A4 gap claim to cite ADR-014 only for "the A1 run used dr_grpo" (and ADR-013 for its sole user-gated box) instead of ADR-014's acceptance gate, which doesn't contain the dataset-construction detail; fixed one more stale `qwen3_05b_quickstart` path. |
|
| 69 |
+
|
| 70 |
+
## Deliberately left alone — and why
|
| 71 |
+
|
| 72 |
+
- **Accepted ADR bodies (ADR-001…014).** Immutable once `accepted` (per the ADR index's own
|
| 73 |
+
rule). Only the ADR *index* (`docs/adrs/README.md`) was updated. The provenance correction
|
| 74 |
+
was propagated into the *living* docs that ADR-014 supersedes, not by editing older ADRs.
|
| 75 |
+
- **`research/01..12`, `framework/`, `publications/PAPER_v0.md` deep-dive bodies.** Preserved as
|
| 76 |
+
point-in-time research snapshots (only the 9 dead links in `framework/` +
|
| 77 |
+
`publications/HF_DISCUSSION_POST.md` were repaired). `docs/COMPOSER_RECIPE_MAPPING.md` and
|
| 78 |
+
`docs/METHODOLOGY.md` were audited and found **already correct** on channel-3 provenance (they
|
| 79 |
+
already frame channel 3 as "NOVEL — our addition / not in Composer"), so they were not rewritten.
|
| 80 |
+
- **`docs/VISION_VALIDATION.md` dated update blocks (e.g. the "77 tests" Wave-12 line).** These
|
| 81 |
+
are explicitly-dated historical self-audit snapshots in the doc's house style; only the current
|
| 82 |
+
top status banner was refreshed. Rewriting dated snapshots would falsify the audit trail.
|
| 83 |
+
- **The two `qwen3_7b_*` proposals in VISION_VALIDATION (lines ~80, ~138).** These accurately
|
| 84 |
+
record example dirs that were *proposed* in the Wave-6 audit but never built under any name;
|
| 85 |
+
"fixing" them to the 0.5B path would misrepresent history. Only the line describing the
|
| 86 |
+
packaging deliverable that actually shipped (`qwen_05b_quickstart`) was corrected.
|
| 87 |
+
|
| 88 |
+
## Notes for a maintainer (issues found but NOT fixable under docs-only / scope)
|
| 89 |
+
|
| 90 |
+
1. **Off-limits dead link.** `examples/gsm8k_grpo_with_sdpo/README.md:66` links to
|
| 91 |
+
`docs/adrs/ADR-002-channel2-sdpo.md`, which does not exist (ADR-002 is `ADR-002-trace-source.md`;
|
| 92 |
+
the SDPO design decision is **ADR-008**). This file is under `examples/` (off-limits to this
|
| 93 |
+
docs-only engagement), so it was left unchanged. **Recommended fix:** repoint that link to
|
| 94 |
+
`docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`.
|
| 95 |
+
2. **API_REFERENCE freshness gap.** `docs/API_REFERENCE.md` documents the `composer_replication`
|
| 96 |
+
surface but has **no section for the trainer config factories** — neither `make_dr_grpo_config`
|
| 97 |
+
(ADR-008) nor the new `make_po_config(objective=…)` / `PO_OBJECTIVES` menu (ADR-014). This is a
|
| 98 |
+
*missing* doc, not a *wrong* one; it was not added here because the public signatures could not
|
| 99 |
+
be verified against source without reading the `.py` (out of scope), and Invariant 7 forbids
|
| 100 |
+
fabricating an API surface. **Recommended fix:** add a config-factory section to API_REFERENCE
|
| 101 |
+
from the verified `composer_replication/trainer/composer_trainer.py` signatures.
|
| 102 |
+
|
| 103 |
+
## Verification (proof, not claim)
|
| 104 |
+
|
| 105 |
+
- **Docs-only invariant:** `git diff --name-only aae66fa..HEAD` → every path ends in `.md`
|
| 106 |
+
(no `.py`, no `pyproject.toml`, nothing under `composer_replication/ examples/ spikes/ tests/`).
|
| 107 |
+
- **Link integrity:** a scripted scan of every relative `[](…)` link across all in-scope `.md`
|
| 108 |
+
(root + `docs/` + `framework/` + `publications/`, excluding the 4 off-limits trees) reports
|
| 109 |
+
**zero dead links** in the changeset. The one known dead link (`ADR-002-channel2-sdpo.md`) is
|
| 110 |
+
pre-existing and lives in off-limits `examples/` — see note 1.
|
| 111 |
+
- **Archive integrity:** every archived file has both a redirect stub at its origin and a
|
| 112 |
+
full-content copy under `_archive/`; all 8 review-dir files preserved (renames, no content loss).
|
| 113 |
+
- **Two adversarial reviews + a deterministic check** all returned zero blockers.
|
docs/adrs/README.md
CHANGED
|
@@ -15,5 +15,12 @@
|
|
| 15 |
| [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
|
| 16 |
| [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
|
| 17 |
| [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
|
|
|
|
| 18 |
|
| 19 |
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
| [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
|
| 16 |
| [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
|
| 17 |
| [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
|
| 18 |
+
| [ADR-014](ADR-014-policy-optimization-objective-menu.md) | Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) | accepted | 2026-05-30 |
|
| 19 |
|
| 20 |
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|
| 21 |
+
|
| 22 |
+
> **Provenance note (ADR-014).** ADR-014 also records the canonical correction that the
|
| 23 |
+
> framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT
|
| 24 |
+
> part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference
|
| 25 |
+
> pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
|
| 26 |
+
> + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary.
|
docs/research/WAVE_13_FINAL_REVIEW.md
CHANGED
|
@@ -1,239 +1,8 @@
|
|
| 1 |
# Wave 13 Adversarial Cross-Model Review
|
| 2 |
|
| 3 |
-
**
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
**CONDITIONAL PASS with two BLOCKERs.** Wave 13 substantially advances
|
| 11 |
-
the brief expansion (serverless DiLoCo abstraction, replaysim
|
| 12 |
-
normalization, three distillation losses, PRIME-RL recipe, Monarch
|
| 13 |
-
tie-in). The **distillation losses are the strongest deliverable** —
|
| 14 |
-
real, well-tested, mathematically faithful to the cited papers. The
|
| 15 |
-
serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are
|
| 16 |
-
also genuine and exercised by 3 real multi-process tests.
|
| 17 |
-
|
| 18 |
-
**However, two material claims are not test-validated, and one new
|
| 19 |
-
module silently produces a degenerate loss in its primary code path.**
|
| 20 |
-
ADR claims that say "X is added to compose_loss" describe code that
|
| 21 |
-
wasn't actually written. The MockManager → DiLoCo "drop-in" is
|
| 22 |
-
unverified end-to-end.
|
| 23 |
-
|
| 24 |
-
Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2
|
| 25 |
-
BLOCKERs + 4 SUGGESTIONs + 2 NITs**.
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)
|
| 30 |
-
|
| 31 |
-
**Severity:** BLOCKER
|
| 32 |
-
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:79-86`
|
| 33 |
-
|
| 34 |
-
The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)`
|
| 35 |
-
log-prob tensors before passing them to `generalized_jsd_loss`, which
|
| 36 |
-
calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector
|
| 37 |
-
is exactly 1.0; its log is 0. Therefore both `student_log_probs` and
|
| 38 |
-
`teacher_log_probs` are identically zero, the JSD between them is 0,
|
| 39 |
-
and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or
|
| 40 |
-
the actual log-prob values.**
|
| 41 |
-
|
| 42 |
-
```python
|
| 43 |
-
>>> import torch.nn.functional as F
|
| 44 |
-
>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
|
| 45 |
-
tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
The docstring calls this "a deliberate approximation," but it is not
|
| 49 |
-
an approximation — it's a mathematically degenerate operation that
|
| 50 |
-
silently disables channel 2.
|
| 51 |
-
|
| 52 |
-
**Fix direction:**
|
| 53 |
-
- Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising
|
| 54 |
-
`NotImplementedError` until PRIME-RL surfaces full logits.
|
| 55 |
-
- Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL
|
| 56 |
-
has working SDPO; mark it deferred.
|
| 57 |
-
|
| 58 |
-
---
|
| 59 |
-
|
| 60 |
-
## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added
|
| 61 |
-
|
| 62 |
-
**Severity:** BLOCKER
|
| 63 |
-
**Evidence:**
|
| 64 |
-
- `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims:
|
| 65 |
-
> `composer_replication.compose_loss` gets new optional kwargs:
|
| 66 |
-
> - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3
|
| 67 |
-
> - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2
|
| 68 |
-
> - `taid_schedule_step: int | None = None`
|
| 69 |
-
> - `taid_total_steps: int | None = None`
|
| 70 |
-
- `composer_replication/loss.py:54-65` actual signature has **none**
|
| 71 |
-
of these. `grep -n "dpo_variant\|sdpo_wrapper\|taid"
|
| 72 |
-
composer_replication/loss.py` returns empty.
|
| 73 |
-
|
| 74 |
-
The new losses live in `composer_replication.distillation` as
|
| 75 |
-
standalone functions but **are not wired into the framework's actual
|
| 76 |
-
loss composition.** A user reading ADR-007 + the README would believe
|
| 77 |
-
`compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)`
|
| 78 |
-
works; it would raise `TypeError`. The 17 distillation tests verify
|
| 79 |
-
the standalone losses but never exercise integration.
|
| 80 |
-
|
| 81 |
-
**Fix direction:**
|
| 82 |
-
- Either (a) add the kwargs to `compose_loss` and write at least one
|
| 83 |
-
integration test combining e.g. SDPO+TAID (~30 LOC change), or
|
| 84 |
-
- (b) downgrade ADR-007 status to "Standalone losses landed;
|
| 85 |
-
integration deferred to Wave 14."
|
| 86 |
-
|
| 87 |
-
---
|
| 88 |
-
|
| 89 |
-
## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields
|
| 90 |
-
|
| 91 |
-
**Severity:** SUGGESTION (would be BLOCKER if a test exercised the real path)
|
| 92 |
-
**Evidence:**
|
| 93 |
-
- `composer_replication/recipes/replaysim/default.yaml` configures
|
| 94 |
-
`text_length_filter`, `words_num_filter`, `special_characters_filter`,
|
| 95 |
-
`document_deduplicator` with `text_keys: ["chosen", "rejected"]`.
|
| 96 |
-
- In the record produced by `_dpo_pair_to_dj_record`, `chosen` and
|
| 97 |
-
`rejected` are **lists of dicts**
|
| 98 |
-
(`[{"role": "assistant", "content": "..."}]`) — not strings.
|
| 99 |
-
- data-juicer's `text_length_filter` expects string-typed fields;
|
| 100 |
-
running it on a list will either crash or no-op silently.
|
| 101 |
-
|
| 102 |
-
The reason no test catches this: tests only validate the real path *if
|
| 103 |
-
data-juicer is installed*, and even then only check `__init__` succeeds.
|
| 104 |
-
There is no test that calls `normalize()` against a real data-juicer
|
| 105 |
-
executor with the default recipe.
|
| 106 |
-
|
| 107 |
-
**Fix direction:**
|
| 108 |
-
- Reshape `_dpo_pair_to_dj_record` to extract `content` strings
|
| 109 |
-
alongside the messages-format list.
|
| 110 |
-
- Add one test (skip-marked unless `data_juicer` is importable) that
|
| 111 |
-
runs the real op-graph on 3 hand-crafted records.
|
| 112 |
-
|
| 113 |
-
---
|
| 114 |
-
|
| 115 |
-
## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end
|
| 116 |
-
|
| 117 |
-
**Severity:** SUGGESTION
|
| 118 |
-
**Evidence:**
|
| 119 |
-
- `composer_replication/diloco/serverless/allreduce.py:188-191` claims
|
| 120 |
-
MockManager "drops into" `make_diloco_outer_loop`.
|
| 121 |
-
- The only test covering MockManager (`test_mock_manager_shape_compat`)
|
| 122 |
-
is a `hasattr` smoke that calls `.allreduce` on a `world_size=1`
|
| 123 |
-
store (passthrough).
|
| 124 |
-
- torchft.Manager has additional surface area
|
| 125 |
-
(`current_step`, `is_leader`, `_pg`, `report_error`,
|
| 126 |
-
internal step accounting) that DiLoCo's `_apply_pseudogradient`
|
| 127 |
-
may consult depending on version.
|
| 128 |
-
|
| 129 |
-
**Fix direction:**
|
| 130 |
-
- Add a single integration test that constructs
|
| 131 |
-
`make_diloco_outer_loop(manager=MockManager(store), ...)` against a
|
| 132 |
-
tiny `nn.Linear` and runs one outer round — even single-process.
|
| 133 |
-
- Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and
|
| 134 |
-
add stubs for any methods DiLoCo actually consults beyond `allreduce`.
|
| 135 |
-
|
| 136 |
-
---
|
| 137 |
-
|
| 138 |
-
## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated
|
| 139 |
-
|
| 140 |
-
**Severity:** SUGGESTION (NIT bordering)
|
| 141 |
-
**Evidence:**
|
| 142 |
-
- README.md and V1_V8_COVERAGE both state: *"9 multi-process tests
|
| 143 |
-
pinning the allreduce barrier."*
|
| 144 |
-
- Actual breakdown:
|
| 145 |
-
- 4 single-process unit tests + `test_mock_manager_shape_compat` (5)
|
| 146 |
-
- 4 multi-process tests spawning subprocesses (parametrized [2,3] of
|
| 147 |
-
`_runs_allreduce_across_replicas`, `_handles_multiple_rounds`,
|
| 148 |
-
`_reports_failed_replicas`)
|
| 149 |
-
- Of the 4 multi-process tests, only **3 actually exercise the
|
| 150 |
-
allreduce barrier**; `_reports_failed_replicas` deliberately raises
|
| 151 |
-
before any allreduce call.
|
| 152 |
-
|
| 153 |
-
**Wave 13 clearly does NOT fake-pass via world_size=1** — the multi-
|
| 154 |
-
process barrier is real. But the count is rounded up.
|
| 155 |
-
|
| 156 |
-
**Fix direction:** Replace "9 multi-process tests" with "9 tests
|
| 157 |
-
covering the serverless DiLoCo layer, of which 4 spawn real
|
| 158 |
-
subprocesses and 3 exercise the allreduce barrier across replicas."
|
| 159 |
-
|
| 160 |
-
---
|
| 161 |
-
|
| 162 |
-
## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`
|
| 163 |
-
|
| 164 |
-
**Severity:** SUGGESTION
|
| 165 |
-
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:62-68`
|
| 166 |
-
computes:
|
| 167 |
-
```python
|
| 168 |
-
grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
This is plain REINFORCE with advantage. PRIME-RL's `LossInputs`
|
| 172 |
-
exposes `inference_logprobs` precisely because GRPO-with-replay-buffer
|
| 173 |
-
requires the importance-sampling ratio
|
| 174 |
-
`exp(trainer_lp - inference_lp)` (PPO-style clipped objective).
|
| 175 |
-
|
| 176 |
-
The file says "SKELETON" so this isn't a hidden bug per se, but the
|
| 177 |
-
loss is **labeled GRPO and is not GRPO**.
|
| 178 |
-
|
| 179 |
-
**Fix direction:** Either implement the ratio + clipping (~20 LOC) or
|
| 180 |
-
rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.
|
| 181 |
-
|
| 182 |
-
---
|
| 183 |
-
|
| 184 |
-
## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__`
|
| 185 |
-
|
| 186 |
-
**Severity:** NIT (this is documented, but README phrasing is slightly soft)
|
| 187 |
-
**Evidence:** Honestly documented as skeletons in the code, ADR-005,
|
| 188 |
-
and README. NIT: a user trying `ModalExecutor()` gets a runtime error
|
| 189 |
-
rather than an import-time clue.
|
| 190 |
-
|
| 191 |
-
**Fix direction:** Low priority. Update README phrase to "skeleton-only
|
| 192 |
-
— raises NotImplementedError until v0.x." Or use a `__getattr__` on
|
| 193 |
-
the package that raises a clearer message.
|
| 194 |
-
|
| 195 |
-
---
|
| 196 |
-
|
| 197 |
-
## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)
|
| 198 |
-
|
| 199 |
-
**Severity:** NIT
|
| 200 |
-
**Evidence:** `test_distillation_losses.py:27-46` calls `simpo_loss`
|
| 201 |
-
with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded
|
| 202 |
-
above by 0; positive values aren't possible from any softmax. The tests
|
| 203 |
-
still verify the formula correctly, but the test inputs aren't legal.
|
| 204 |
-
|
| 205 |
-
**Fix direction:** Use negative values — purely cosmetic.
|
| 206 |
-
|
| 207 |
-
---
|
| 208 |
-
|
| 209 |
-
## Cross-cutting risk check
|
| 210 |
-
|
| 211 |
-
73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still
|
| 212 |
-
pass. The new `composer_replication.diloco.serverless` package is
|
| 213 |
-
purely additive; the existing `make_diloco_outer_loop` is untouched.
|
| 214 |
-
**No cross-wave regressions detected on CPU.** GPU tests + slow CPU
|
| 215 |
-
e2e tests not re-run; regression risk low since Wave 13 doesn't touch
|
| 216 |
-
their dependencies.
|
| 217 |
-
|
| 218 |
-
---
|
| 219 |
-
|
| 220 |
-
## Summary scorecard
|
| 221 |
-
|
| 222 |
-
| Item | Verdict |
|
| 223 |
-
|---|---|
|
| 224 |
-
| Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone | ✅ Real, well-tested, paper-faithful |
|
| 225 |
-
| Distillation integrated into `compose_loss` | ❌ **Not implemented** despite ADR-007 (Finding 2) |
|
| 226 |
-
| ObjectStoreAllReduce + LocalProcessExecutor | ✅ Real multi-process barrier validated |
|
| 227 |
-
| MockManager → DiLoCo drop-in | 🟡 Shape-checked only; integration unverified (Finding 4) |
|
| 228 |
-
| Modal/HFJobs adapters | 🟡 Honestly documented as skeletons (Finding 7) |
|
| 229 |
-
| Replaysim DJNormalizer passthrough | ✅ Works |
|
| 230 |
-
| Replaysim default.yaml against real data-juicer | ❌ **Recipe field types don't match record shape** (Finding 3) |
|
| 231 |
-
| PRIME-RL composer_loss.loss_fn | ❌ **SDPO term silently 0** (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) |
|
| 232 |
-
| Monarch actors | ✅ Honest skeleton; raises NotImplementedError |
|
| 233 |
-
| Altered-minds tie-in doc | ✅ Design-only, scoped honestly |
|
| 234 |
-
| 35 new tests | All pass; 3 of 4 multi-process tests are genuine (Finding 5) |
|
| 235 |
-
|
| 236 |
-
**Recommendation:** Address Findings 1 and 2 before publishing the
|
| 237 |
-
Wave 13 expansion as "closed." Findings 3 and 4 should be addressed
|
| 238 |
-
before any user attempts the real data-juicer or real torchft DiLoCo
|
| 239 |
-
path. Findings 5–8 are cleanup.
|
|
|
|
| 1 |
# Wave 13 Adversarial Cross-Model Review
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This point-in-time wave review has been moved to
|
| 4 |
+
> [`docs/research/_archive/WAVE_13_FINAL_REVIEW.md`](_archive/WAVE_13_FINAL_REVIEW.md).
|
| 5 |
+
> It is preserved verbatim for provenance (ADR-007 cites its "Finding 2") but is
|
| 6 |
+
> superseded by the current [`docs/METHODOLOGY.md`](../METHODOLOGY.md) and
|
| 7 |
+
> [`BACKLOG.md`](../../BACKLOG.md). See
|
| 8 |
+
> [`docs/research/_archive/README.md`](_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/research/WAVE_14_FINAL_REVIEW.md
CHANGED
|
@@ -1,263 +1,7 @@
|
|
| 1 |
# Wave 14 Adversarial Cross-Model Review
|
| 2 |
|
| 3 |
-
**
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
**CONDITIONAL PASS with 1 BLOCKER + 4 SUGGESTIONs + 2 NITs.** Wave 14
|
| 10 |
-
closes Wave 13 BLOCKER 2 (T1 — compose_loss kwargs) and Suggestion 3
|
| 11 |
-
(T2 — replaysim) cleanly. T3 (MockManager surface audit) is solid but
|
| 12 |
-
only tests `world_size=1`. **T4 (PRIME-RL "real GRPO + DPPO") does not
|
| 13 |
-
match PRIME-RL's actual `default_loss_fn`** despite claiming to mirror
|
| 14 |
-
it; that error has been pasted into USER_GUIDE.md, INTEGRATION_RECIPES.md,
|
| 15 |
-
and API_REFERENCE.md.
|
| 16 |
-
|
| 17 |
-
Same signal-to-noise as Wave 11 + Wave 13 reviewers: 1 genuine BLOCKER.
|
| 18 |
-
|
| 19 |
-
---
|
| 20 |
-
|
| 21 |
-
## Finding 1 — BLOCKER: T4 PRIME-RL "DPPO importance-sampling-ratio clip" is neither importance sampling nor matches PRIME-RL.
|
| 22 |
-
|
| 23 |
-
**Severity:** BLOCKER
|
| 24 |
-
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:118-131`.
|
| 25 |
-
The implementation computes
|
| 26 |
-
```python
|
| 27 |
-
grpo_loss = -(advantages * trainer_lp * keep_mask).sum() / keep_mask.sum()
|
| 28 |
-
```
|
| 29 |
-
That's **pure REINFORCE-with-advantage** — the masking gate is the only
|
| 30 |
-
nod toward DPPO; there is no importance-sampling ratio multiplication
|
| 31 |
-
anywhere.
|
| 32 |
-
|
| 33 |
-
**Real PRIME-RL** (`/tmp/prime-rl-clone/src/prime_rl/trainer/rl/loss.py:128-153`,
|
| 34 |
-
the `default_loss_fn` on `main` as of 2026-05-26):
|
| 35 |
-
```python
|
| 36 |
-
log_importance_ratio = trainer_logprobs - inference_logprobs
|
| 37 |
-
importance_ratio = torch.exp(log_importance_ratio)
|
| 38 |
-
probs_diff = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs)
|
| 39 |
-
positive_advantages = advantages > 0
|
| 40 |
-
dppo_invalid_mask_high = probs_diff > loss_config.dppo_mask_high
|
| 41 |
-
dppo_invalid_mask_low = probs_diff < -loss_config.dppo_mask_low
|
| 42 |
-
dppo_invalid_mask = torch.where(positive_advantages, dppo_invalid_mask_high, dppo_invalid_mask_low)
|
| 43 |
-
keep_mask = loss_mask & ~dppo_invalid_mask
|
| 44 |
-
pg_loss = -(keep_mask * advantages * importance_ratio).sum() # NO division
|
| 45 |
-
kl_loss = adv_tau * (log_importance_ratio**2 * keep_mask).sum() # KL term
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
**Three concrete divergences from Wave 14's implementation:**
|
| 49 |
-
|
| 50 |
-
1. **Mask gate is on `probs_diff`** (a probability-space quantity), NOT
|
| 51 |
-
`log_ratio` (a log-space quantity). These have different magnitudes:
|
| 52 |
-
`probs_diff=0.2` corresponds to `log_ratio≈log(1.2)≈0.18` for a
|
| 53 |
-
trainer prob of 1.0 vs inference prob of 0.8. With our `log_ratio>4.0`
|
| 54 |
-
gate, the mask never fires for normal training distributions; PRIME-RL's
|
| 55 |
-
`probs_diff>0.2` gate fires routinely.
|
| 56 |
-
|
| 57 |
-
2. **PRIME-RL multiplies by `importance_ratio = exp(log_ratio)`**;
|
| 58 |
-
Wave 14 multiplies by `trainer_lp` directly. This is the difference
|
| 59 |
-
between actual policy-gradient correction (PRIME-RL) and naive
|
| 60 |
-
REINFORCE.
|
| 61 |
-
|
| 62 |
-
3. **PRIME-RL's mask is sign-conditioned on advantage** (positive
|
| 63 |
-
advantages clipped against `dppo_mask_high`, negative against
|
| 64 |
-
`-dppo_mask_low`); Wave 14 ORs them together unconditionally.
|
| 65 |
-
|
| 66 |
-
**Plus:** the KL term is missing entirely.
|
| 67 |
-
|
| 68 |
-
**Plus:** the defaults claimed as "PRIME-RL's defaults" — `dppo_mask_high=4.0,
|
| 69 |
-
dppo_mask_low=-4.0` — are wrong. PRIME-RL's `DefaultLossConfig`
|
| 70 |
-
(`configs/trainer.py:412-424`) sets `dppo_mask_high=0.2, dppo_mask_low=0.2`
|
| 71 |
-
with `Field(..., ge=0)` validation that would *reject* a negative value.
|
| 72 |
-
PRIME-RL's code negates at use site: `probs_diff < -loss_config.dppo_mask_low`.
|
| 73 |
-
|
| 74 |
-
**Plus:** the docstring (`composer_loss.py:32-49`), USER_GUIDE.md:599-608,
|
| 75 |
-
INTEGRATION_RECIPES.md:426-429 + 482-487, and API_REFERENCE.md:1364 all
|
| 76 |
-
repeat the wrong formula and the wrong "matches PRIME-RL" claim.
|
| 77 |
-
|
| 78 |
-
**Fix direction:** Either (a) actually mirror `default_loss_fn` (mask on
|
| 79 |
-
`probs_diff`, multiply by `importance_ratio`, add KL term, advantage-
|
| 80 |
-
conditioned mask, `.sum()` reduction with token-count returned for
|
| 81 |
-
caller-side scaling), or (b) drop the "matches PRIME-RL" framing and
|
| 82 |
-
rename to "REINFORCE-with-advantage stub + log-ratio mask" everywhere.
|
| 83 |
-
|
| 84 |
-
Wave 13 Finding 6 is **not actually closed** by Wave 14.
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
## Finding 2 — SUGGESTION: ADR-007 still says Wave 14 hasn't done the integration.
|
| 89 |
-
|
| 90 |
-
**Severity:** SUGGESTION
|
| 91 |
-
**Evidence:** `docs/adrs/ADR-007-self-distillation-losses.md:104-122` reads:
|
| 92 |
-
> **Wave 14+ work — `compose_loss` integration is NOT in this wave**
|
| 93 |
-
> ... Wave 14 plan: add the four kwargs ...
|
| 94 |
-
|
| 95 |
-
But Wave 14 *did* add them (verified — `loss.py:80-93`). The ADR was
|
| 96 |
-
written defensively after Wave 13 review and never updated when T1 landed.
|
| 97 |
-
|
| 98 |
-
**Net effect:** a user reading ADR-007 is told the SimPO/TAID kwargs
|
| 99 |
-
don't work; a user reading USER_GUIDE/API_REFERENCE is told they do.
|
| 100 |
-
|
| 101 |
-
**Fix direction:** flip ADR-007 status section to "Closed in Wave 14 —
|
| 102 |
-
see test_compose_loss_integration.py".
|
| 103 |
-
|
| 104 |
-
---
|
| 105 |
-
|
| 106 |
-
## Finding 3 — SUGGESTION: ModalExecutor instantiation example in INTEGRATION_RECIPES is dead code.
|
| 107 |
-
|
| 108 |
-
**Severity:** SUGGESTION
|
| 109 |
-
**Evidence:** `docs/INTEGRATION_RECIPES.md:519-533` shows
|
| 110 |
-
```python
|
| 111 |
-
executor = ModalExecutor(app="composer-prime-rl")
|
| 112 |
-
executor.launch_replicas(...)
|
| 113 |
-
```
|
| 114 |
-
But `composer_replication/diloco/serverless/modal.py:64-66` raises
|
| 115 |
-
`NotImplementedError` from `__init__`. Same pattern in `HFJobsExecutor`.
|
| 116 |
-
The recipe doc warns about skeleton-status much further down (line 731),
|
| 117 |
-
but the inline code example at line 519 will break the moment a reader
|
| 118 |
-
copy-pastes it.
|
| 119 |
-
|
| 120 |
-
Wave 13 Finding 7 noted this softness; Wave 14 made it worse by writing
|
| 121 |
-
example code that calls a constructor that always raises.
|
| 122 |
-
|
| 123 |
-
**Fix direction:** in every code block that calls `ModalExecutor(...)`,
|
| 124 |
-
prepend a comment `# Wave 14: skeleton — raises NotImplementedError`
|
| 125 |
-
or flip examples to `LocalProcessExecutor`.
|
| 126 |
-
|
| 127 |
-
---
|
| 128 |
-
|
| 129 |
-
## Finding 4 — SUGGESTION: MockManager + DiLoCo integration test only exercises `world_size=1`.
|
| 130 |
-
|
| 131 |
-
**Severity:** SUGGESTION
|
| 132 |
-
**Evidence:** `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py:44-51`,
|
| 133 |
-
`:108-109`, `:161`. Both `test_mockmanager_diloco_outer_round_completes`
|
| 134 |
-
and `test_mockmanager_diloco_two_outer_rounds_step_counter` use
|
| 135 |
-
`world_size=1`.
|
| 136 |
-
|
| 137 |
-
With one replica, `ObjectStoreAllReduce.allreduce` returns the tensor
|
| 138 |
-
unchanged (its own mean), so an averaging bug in the multi-replica path
|
| 139 |
-
could not be caught by this test. The pseudo-gradient sign convention
|
| 140 |
-
is pinned by the unrelated spike-008 test, but **no test combines
|
| 141 |
-
MockManager + DiLoCo + multi-process** — i.e. the actual deployment
|
| 142 |
-
scenario is unverified end-to-end.
|
| 143 |
-
|
| 144 |
-
Wave 13 Finding 4 is closed in spirit (call surface is now exhaustive)
|
| 145 |
-
but not in the deepest sense.
|
| 146 |
-
|
| 147 |
-
**Fix direction:** add one multi-process test that spawns `n_replicas`
|
| 148 |
-
subprocesses, each constructing `MockManager(store) → make_diloco_outer_loop`,
|
| 149 |
-
and asserts that after one outer round all replicas converge to the same
|
| 150 |
-
parameter values (i.e. averaging actually happened).
|
| 151 |
-
|
| 152 |
-
---
|
| 153 |
-
|
| 154 |
-
## Finding 5 — SUGGESTION: T4 unit tests pin the wrong implementation as ground truth.
|
| 155 |
-
|
| 156 |
-
**Severity:** SUGGESTION
|
| 157 |
-
**Evidence:** `composer_replication/recipes/prime_rl/tests/test_composer_loss.py:90-128`
|
| 158 |
-
(`test_dppo_mask_clips_extreme_ratios`). The expected value `1.5/3` is
|
| 159 |
-
computed against the buggy formula (Finding 1).
|
| 160 |
-
|
| 161 |
-
The 10 PRIME-RL tests all pass — but they're testing self-consistency,
|
| 162 |
-
not parity with PRIME-RL. A reader looking at "10 unit tests, all green"
|
| 163 |
-
infers correctness; correctness is not what they verify. This is the
|
| 164 |
-
kind of test honesty failure that Wave 11 + Wave 13 reviewers found in
|
| 165 |
-
different forms.
|
| 166 |
-
|
| 167 |
-
**Fix direction:** add at least one test whose expected value is
|
| 168 |
-
hand-computed from `default_loss_fn` in PRIME-RL (or import + invoke
|
| 169 |
-
`default_loss_fn` if the dependency is available, mark the test
|
| 170 |
-
`@pytest.mark.skipif(not _HAS_PRIME_RL)`).
|
| 171 |
-
|
| 172 |
-
---
|
| 173 |
-
|
| 174 |
-
## Finding 6 — NIT: README/test-count drift.
|
| 175 |
-
|
| 176 |
-
Wave 14 task description claims "124 tests passing as of Wave 14"; actual
|
| 177 |
-
`pytest --collect-only` reports **134 collected**. Of those, the 61-test
|
| 178 |
-
wave-relevant subset all pass. Not a defect, but the headline number is
|
| 179 |
-
now off in the same way Wave 13's "9 multi-process tests" was off.
|
| 180 |
-
|
| 181 |
-
---
|
| 182 |
-
|
| 183 |
-
## Finding 7 — NIT: `loss_fn` docstring claims "DPPO importance-sampling-ratio clipping — implemented" (`composer_loss.py:9`).
|
| 184 |
-
|
| 185 |
-
Implementation contains no importance-ratio multiplication anywhere.
|
| 186 |
-
Even if Finding 1 is rejected and the team decides "PRIME-RL match isn't
|
| 187 |
-
a goal", the docstring is internally false: it announces ISR clipping
|
| 188 |
-
in a function that does not multiply by `exp(log_ratio)`.
|
| 189 |
-
|
| 190 |
-
---
|
| 191 |
-
|
| 192 |
-
## Cross-cutting
|
| 193 |
-
|
| 194 |
-
The four doc subagents wrote internally consistent text but inherited
|
| 195 |
-
T4's mathematical error. **Three of the four doc files repeat the same
|
| 196 |
-
wrong formula verbatim.** This is exactly the failure mode Wave 11/13
|
| 197 |
-
reviewers flagged: parallel subagents cross-citing each other rather
|
| 198 |
-
than the upstream source of truth.
|
| 199 |
-
|
| 200 |
-
The 61 tests in the Wave-14-touched dirs pass cleanly. T1, T2, and T3
|
| 201 |
-
are real closures with real coverage. The framework is in a **better**
|
| 202 |
-
state than end-of-Wave-13 — but it has not actually closed Wave 13
|
| 203 |
-
Finding 6, and it has propagated a subtler version of the same
|
| 204 |
-
mathematical-mismatch bug into the user-facing documentation.
|
| 205 |
-
|
| 206 |
-
---
|
| 207 |
-
|
| 208 |
-
## Summary scorecard
|
| 209 |
-
|
| 210 |
-
| Wave 13 Finding | Wave 14 status | Verdict |
|
| 211 |
-
|---|---|---|
|
| 212 |
-
| BLOCKER 1 (PRIME-RL SDPO degenerate) | Fixed parent-side; channel 2 raises NotImplementedError | ✅ closed |
|
| 213 |
-
| BLOCKER 2 (compose_loss kwargs not added) | T1 added them + 11 integration tests | ✅ closed |
|
| 214 |
-
| Suggestion 3 (replaysim YAML field types) | T2 dual-shape reshape + real DJ e2e + caught related bug | ✅ closed |
|
| 215 |
-
| Suggestion 4 (MockManager → DiLoCo gap) | T3 surface audit + integration test | 🟡 closed for `world_size=1`; multi-process unverified |
|
| 216 |
-
| Suggestion 5 ("9 multi-process tests" inflated count) | Not addressed | 🟡 carried over |
|
| 217 |
-
| Suggestion 6 (PRIME-RL channel 1 REINFORCE not GRPO) | T4 thought it closed this | ❌ **NOT closed — mathematically wrong** |
|
| 218 |
-
| Suggestion 7 (Modal/HFJobs skeleton clarity) | Made worse by INTEGRATION_RECIPES dead code | 🟡 regression |
|
| 219 |
-
| NIT 8 (SimPO test positive log-probs) | Not addressed | 🟡 carried over |
|
| 220 |
-
|
| 221 |
-
## Wave 14b follow-up (2026-05-26)
|
| 222 |
-
|
| 223 |
-
After Wave 14b closed Finding 1 by re-reading PRIME-RL upstream and
|
| 224 |
-
matching `default_loss_fn` byte-for-byte, the Wave 14b subagent flagged
|
| 225 |
-
a **new** structural issue not in the Wave 14 review:
|
| 226 |
-
|
| 227 |
-
**PRIME-RL's `setup_loss_fns` (upstream `loss.py:320-327`) expects the
|
| 228 |
-
custom loss function to return `LossOutputs(loss, metrics={...})`, not
|
| 229 |
-
a bare scalar tensor.** Our recipe still returns a bare scalar. This
|
| 230 |
-
predates Wave 14 (it's been wrong since the recipe was first written in
|
| 231 |
-
Wave 13) but was never caught because no test runs against actual
|
| 232 |
-
PRIME-RL.
|
| 233 |
-
|
| 234 |
-
**Status:** documented; deferred to Wave 15. Not blocking for Wave 14b's
|
| 235 |
-
closure of Finding 1, because the formula now matches upstream — the
|
| 236 |
-
return-shape mismatch is a separate adapter-level issue. Tests still
|
| 237 |
-
pass because they invoke our `loss_fn` directly without going through
|
| 238 |
-
PRIME-RL's `compute_loss` pipeline.
|
| 239 |
-
|
| 240 |
-
**Fix direction (Wave 15):** wrap the return value in a duck-typed
|
| 241 |
-
`LossOutputs` (provided by PRIME-RL when installed; substituted with a
|
| 242 |
-
NamedTuple shim when not). Add an integration smoke test against PRIME-RL
|
| 243 |
-
to catch this and similar adapter-shape regressions.
|
| 244 |
-
|
| 245 |
-
## Final Wave 14 + 14b status
|
| 246 |
-
|
| 247 |
-
| Wave 13 / 14 finding | Wave 14b status |
|
| 248 |
-
|---|---|
|
| 249 |
-
| W13 BLOCKER 1: PRIME-RL SDPO degenerate | ✅ closed (parent, channel 2 deferred) |
|
| 250 |
-
| W13 BLOCKER 2: compose_loss kwargs not added | ✅ closed (Wave 14 T1) |
|
| 251 |
-
| W13 Suggestion 3: replaysim YAML field types | ✅ closed (Wave 14 T2) |
|
| 252 |
-
| W13 Suggestion 4: MockManager → DiLoCo gap | ✅ closed (Wave 14 T3 + Wave 14b multi-process test) |
|
| 253 |
-
| W13 Suggestion 6: PRIME-RL channel 1 REINFORCE not GRPO | ✅ **closed in Wave 14b** (matches upstream `default_loss_fn`) |
|
| 254 |
-
| W14 Finding 1: PRIME-RL impl wrong | ✅ closed in Wave 14b |
|
| 255 |
-
| W14 Finding 2: ADR-007 stale | ✅ closed in Wave 14b |
|
| 256 |
-
| W14 Finding 3: ModalExecutor dead code | ✅ closed in Wave 14b |
|
| 257 |
-
| W14 Finding 4: world_size=1 only | ✅ closed in Wave 14b (multi-process convergence test) |
|
| 258 |
-
| W14 Finding 5: tests pin wrong impl as ground truth | ✅ closed in Wave 14b (parity test added) |
|
| 259 |
-
| W14 NIT 6: test count drift | 🟡 carried |
|
| 260 |
-
| W14 NIT 7: docstring claims ISR clipping | ✅ closed in Wave 14b (real ISR now implemented) |
|
| 261 |
-
| **NEW (Wave 14b)**: PRIME-RL `LossOutputs` return shape | 🟡 deferred to Wave 15 |
|
| 262 |
-
|
| 263 |
-
**Tests as of Wave 14b: 115 passing + 1 skip-marked (OPSD parity test, runs when upstream cloned).** (Wave 12: 72; Wave 13: 93; Wave 14: 124; Wave 14b: 130; Wave 15: 115 after TAID rewrite consolidation + OPSD parity.)
|
|
|
|
| 1 |
# Wave 14 Adversarial Cross-Model Review
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This point-in-time wave review has been moved to
|
| 4 |
+
> [`docs/research/_archive/WAVE_14_FINAL_REVIEW.md`](_archive/WAVE_14_FINAL_REVIEW.md).
|
| 5 |
+
> It is preserved verbatim for provenance but is superseded by the current
|
| 6 |
+
> [`docs/METHODOLOGY.md`](../METHODOLOGY.md) and [`BACKLOG.md`](../../BACKLOG.md). See
|
| 7 |
+
> [`docs/research/_archive/README.md`](_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/research/WAVE_15_FINAL_REVIEW.md
CHANGED
|
@@ -1,76 +1,7 @@
|
|
| 1 |
# Wave 15 Final Review — Multi-Angle Self-Critique + Fix Wave
|
| 2 |
|
| 3 |
-
**
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
**The math reviewer found 2 BLOCKERs that all 8+ prior subagents missed.** Both came from `git clone`-ing upstream and doing line-by-line diffs against the framework's `composer_replication/opsd.py` and `composer_replication/distillation/taid.py` — something no prior reviewer had done for those files (Wave 14b reviewer did it for PRIME-RL only).
|
| 9 |
-
|
| 10 |
-
This validates the user's instinct that "every angle" multi-model orchestration is worth doing — the math angle, given a sharp prompt that mandated upstream verification, found genuine bugs in the framework's primary loss kernel.
|
| 11 |
-
|
| 12 |
-
## Wave 15a reviews (all 4 deliverables)
|
| 13 |
-
|
| 14 |
-
| Reviewer | Focus | BLOCKERs | Severity-weighted findings |
|
| 15 |
-
|---|---|---|---|
|
| 16 |
-
| Math correctness (Opus 4.7) | 7 claimed implementations vs primary sources | **2 BLOCKER + 3 minor** | `generalized_jsd_loss` math wrong; `taid_loss` algorithm wrong |
|
| 17 |
-
| Test honesty (Opus 4.7) | 3 specific test files | 0 BLOCKER + 3 weak-assertions | PRIME-RL parity skip silently never runs; bit-exact uses `allclose` not `equal`; entropy-OPD test is pure smoke |
|
| 18 |
-
| Documentation drift (Opus 4.7) | 6 major docs + ADRs | 0 BLOCKER + 7 drifts | test count drift (77/107/124 vs actual 145); `compose_loss` kwarg drift; PRIME-RL test count 10 vs 16; stale "Deferred to Wave 14" claim |
|
| 19 |
-
| User journey (Opus 4.7) | RL-finetune Qwen-7B on GSM8K | 0 BLOCKER + 10 friction items | **No GSM8K example** (#1 ask); no runnable `ComposerReplicationTrainer` recipe; data-collator gap undocumented; defaults activate channels users haven't configured |
|
| 20 |
-
|
| 21 |
-
Reports saved at `/tmp/wave15_{math,test,doc,user}_review.md`.
|
| 22 |
-
|
| 23 |
-
## Wave 15b — fix scatter outcomes
|
| 24 |
-
|
| 25 |
-
5 parallel fix subagents dispatched. Outcomes:
|
| 26 |
-
|
| 27 |
-
| Task | Subagent outcome |
|
| 28 |
-
|---|---|
|
| 29 |
-
| (1) OPSD math rewrite vs upstream | ✅ Completed. New parity test (skip-marked) verifies 31 cases against upstream `siyan-zhao/OPSD`. Mixture distribution now β-weighted (was hardcoded 0.5); β coefficient on correct terms (was swapped); reduction matches upstream (was off by 100-2000× factor). Docstring labels fixed (β=0 = reverse KL, β=1 = forward KL). |
|
| 30 |
-
| (2) TAID rewrite vs upstream | ⚠️ Subagent timed out at 600s but **work landed**: logit-space mix (was prob-space), current-student-detached anchor (was frozen step-0), forward-KL criterion (was JSD), optional `TAIDScheduler` for adaptive scheme. Docstring rewritten to acknowledge the breaking change. Tests updated. Parity test added. |
|
| 31 |
-
| (3) GSM8K example | ⚠️ Subagent timed out but **work landed**: `examples/gsm8k_grpo/run.py` runs end-to-end on CPU with Qwen2.5-0.5B-Instruct, 100 GSM8K rows, regex-based verifiable reward, 2 outer steps in 58s. README written by parent agent. The `run_with_sdpo.py` variant deferred to Wave 16. |
|
| 32 |
-
| (4) Doc drift + install ergonomics | ⚠️ Subagent timed out. **Parent completed:** flipped `alpha_sdpo` and `beta_replay` defaults to 0.0; added clear ImportError if TRL missing; fixed TROUBLESHOOTING `[replay]` extras claim; updated README + USER_GUIDE + INTEGRATION_RECIPES test counts to reference V1_V8_COVERAGE; closed stale "Deferred to Wave 14" claim. |
|
| 33 |
-
| (5) Test hardening + LossOutputs wrap | ✅ Completed (3 of 4 sub-tasks). PRIME-RL `loss_fn` now returns `LossOutputs(loss, metrics)`. Bit-exact test tightened to `torch.equal`. PRIME-RL parity test now emits visible warning when prime-rl unavailable. Gradient-flow tests deferred to Wave 16. |
|
| 34 |
-
|
| 35 |
-
## Final test count post-Wave-15: 115 passing + 1 skip-marked
|
| 36 |
-
|
| 37 |
-
- Wave-by-wave: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → **115** (W15)
|
| 38 |
-
- Net decrease from 130: TAID rewrite consolidated 16 schedule-specific tests into 7 `t`-parameterized tests (smaller surface but stronger contracts: each test now exercises the actual paper algorithm). Plus 1 skip-marked OPSD parity test.
|
| 39 |
-
- Trade-off: fewer tests, but 2 BLOCKER-class math bugs eliminated. Net correctness improvement is large.
|
| 40 |
-
|
| 41 |
-
## What this round caught vs missed
|
| 42 |
-
|
| 43 |
-
### Caught (improvements over Wave 14b state)
|
| 44 |
-
- 2 math BLOCKERs in primary loss kernels, fixed against upstream byte-for-byte
|
| 45 |
-
- TAID rewrite from misnamed prob-space-JSD-with-frozen-anchor to actual SakanaAI/TAID
|
| 46 |
-
- PRIME-RL `LossOutputs` adapter wrap — recipe is now actually invokable from PRIME-RL
|
| 47 |
-
- GSM8K real-task example — closes the user-reviewer's #1 friction
|
| 48 |
-
- Default kwargs (`alpha_sdpo=0.1` → `0.0`) — no more silent activation of unconfigured channels
|
| 49 |
-
- TRL ImportError clarity — no more cryptic `object.__init__()` errors
|
| 50 |
-
- Test count drift — single canonical doc (V1_V8_COVERAGE)
|
| 51 |
-
- TROUBLESHOOTING `[replay]` extras correctly described
|
| 52 |
-
|
| 53 |
-
### Missed (Wave 16 candidates)
|
| 54 |
-
- `run_with_sdpo.py` — promised but not shipped this wave
|
| 55 |
-
- 3 gradient-flow tests for compose_loss channels (test reviewer's #4)
|
| 56 |
-
- Multi-process MockManager + DiLoCo convergence test was added in Wave 14b but only at world_size=2; user reviewer didn't probe larger
|
| 57 |
-
- Recon docs (`docs/research/*RECONNAISSANCE.md`) not cross-checked against current code state — likely some staleness
|
| 58 |
-
- PRIME-RL recipe still hasn't been run end-to-end against actual prime-rl (parity test skip-marks; LossOutputs wrap added but not exercised)
|
| 59 |
-
|
| 60 |
-
## Methodological lessons for future waves
|
| 61 |
-
|
| 62 |
-
1. **Prompt subagents to clone upstream and diff** when the task is "verify against external truth." 8+ prior reviewers checked papers but did not `git clone`. The instruction "read /tmp/X-clone/file.py and find every divergence" produced the BLOCKER-class findings.
|
| 63 |
-
|
| 64 |
-
2. **600s subagent timeout is the dominant constraint at this scope.** 3 of 5 fix subagents timed out despite making real progress. Workaround: write the report file FIRST as a skeleton, iterate in place. (Subagents that did this completed; subagents that read everything then tried to write at the end timed out.)
|
| 65 |
-
|
| 66 |
-
3. **Cross-cutting parallel-subagent failure mode**: subagents cite each other instead of upstream. Wave 14 caught this for PRIME-RL math. Wave 15 caught it for OPSD + TAID math. The mitigation is mandate-upstream-verification in the prompt.
|
| 67 |
-
|
| 68 |
-
4. **Prompt injection in tool outputs**: one subagent flagged that fake "don't reproduce copyrighted material" instructions appeared in its tool outputs throughout, designed to make it abandon the OPSD math fix. The subagent correctly ignored the injection and completed the task. The framework's MIT-licensed work with attribution is fully authorized; no copyright concern.
|
| 69 |
-
|
| 70 |
-
## Open items for Wave 16
|
| 71 |
-
|
| 72 |
-
1. `examples/gsm8k_grpo_with_sdpo/` — demonstrate SDPO column wiring end-to-end
|
| 73 |
-
2. Gradient-flow tests for compose_loss channels (pre-staged in test reviewer's report)
|
| 74 |
-
3. Recon-doc currency sweep: cross-check `docs/research/*RECONNAISSANCE.md` against current code state
|
| 75 |
-
4. Real PRIME-RL end-to-end run with the new `LossOutputs` wrap (verify the wrap shape works in the real `setup_loss_fns` pipeline)
|
| 76 |
-
5. `INTEGRATION_RECIPES.md` `compose_loss` signature display — collapse to `...` and link to `API_REFERENCE.md`, OR sync to all 17 kwargs
|
|
|
|
| 1 |
# Wave 15 Final Review — Multi-Angle Self-Critique + Fix Wave
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This point-in-time wave review has been moved to
|
| 4 |
+
> [`docs/research/_archive/WAVE_15_FINAL_REVIEW.md`](_archive/WAVE_15_FINAL_REVIEW.md).
|
| 5 |
+
> It is preserved verbatim for provenance but is superseded by the current
|
| 6 |
+
> [`docs/METHODOLOGY.md`](../METHODOLOGY.md) and [`BACKLOG.md`](../../BACKLOG.md). See
|
| 7 |
+
> [`docs/research/_archive/README.md`](_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/research/WAVE_7_10_FINAL_REVIEW.md
CHANGED
|
@@ -1,423 +1,8 @@
|
|
| 1 |
# Wave 7–10 Final Review — Cross-model adversarial check
|
| 2 |
|
| 3 |
-
**
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
## (a) Are the tests real evidence or theater?
|
| 12 |
-
|
| 13 |
-
### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests)
|
| 14 |
-
|
| 15 |
-
**Verdict: mostly tautology, with usable ablation tests.**
|
| 16 |
-
|
| 17 |
-
The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is
|
| 18 |
-
technically true and substantively near-tautological:
|
| 19 |
-
|
| 20 |
-
1. **The same fixed ~50-token batch is reused for all 5 steps.**
|
| 21 |
-
`build_batch` returns one conversation; the test loop calls
|
| 22 |
-
`compose_loss(model, batch)` five times in a row. No reshuffle, no
|
| 23 |
-
second batch, no held-out anything.
|
| 24 |
-
2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch ×
|
| 25 |
-
5 steps = textbook memorization regime.** A randomly-initialized MLP
|
| 26 |
-
would also reduce loss in this setup. The test does not distinguish
|
| 27 |
-
"the 3-channel composition is correct" from "AdamW reduces fixed-batch
|
| 28 |
-
loss on any non-degenerate objective."
|
| 29 |
-
3. **The SDPO channel is zero throughout** (`sdpo_jsd=0.0` on every
|
| 30 |
-
row of `loss_curve.csv`). The verdict calls this "correct fallback
|
| 31 |
-
behavior"; what it actually is is *the entire SDPO channel never
|
| 32 |
-
being tested by this smoke*. The fallback is a literal `_zero(device)`.
|
| 33 |
-
`generalized_jsd_loss` has no end-to-end test on a real HF model
|
| 34 |
-
anywhere in the codebase. **This is the largest evidence gap for
|
| 35 |
-
V8.**
|
| 36 |
-
4. **DPO uses dummy hard-coded reference logprobs** (`-30.0`, `-35.0`).
|
| 37 |
-
This tests that `-logsigmoid(small_positive)` is differentiable, not
|
| 38 |
-
that the trace-replay-DPO pipeline (reference-policy precompute +
|
| 39 |
-
collator + loss) wires together.
|
| 40 |
-
5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the
|
| 41 |
-
weakest version of monotonicity.
|
| 42 |
-
|
| 43 |
-
**Genuine value**: model-loads test, chat-template test, the three
|
| 44 |
-
α=0 / β=0 ablation tests. The ablations would catch a regression where
|
| 45 |
-
weights stop disabling channels. The "5-step decrease" is the weakest
|
| 46 |
-
test in the file.
|
| 47 |
-
|
| 48 |
-
**Run.log inconsistency, not flagged anywhere**:
|
| 49 |
-
`examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379;
|
| 50 |
-
the spike `verdict.md` quotes step-1 total = 0.2090 for the same code,
|
| 51 |
-
same model. Either the seed isn't pinned through the model forward
|
| 52 |
-
(likely — `torch.manual_seed(42)` is in `build_batch` only), or the
|
| 53 |
-
package's `compose_loss` differs subtly from the spike's. **Quoting
|
| 54 |
-
exact numbers from a non-reproducible run as evidence is the sloppy
|
| 55 |
-
version of every research-replication scandal.**
|
| 56 |
-
|
| 57 |
-
### Spike 007 (Claude Code ingester, 15 tests)
|
| 58 |
-
|
| 59 |
-
**Verdict: strongest test suite of the three. Caveats apply.**
|
| 60 |
-
|
| 61 |
-
Real engineering value:
|
| 62 |
-
- Synthetic fixture exercises the actual record types (assistant,
|
| 63 |
-
user/tool_result, summary, system, sidechain). Tests assert structural
|
| 64 |
-
properties: history grows monotonically, `[THINKING]` stripped on
|
| 65 |
-
replay but kept in student_action, unique state_ids, tool_use
|
| 66 |
-
serialization, tool_result tagging.
|
| 67 |
-
- `test_truncated_line_tolerated` would catch a real failure-mode
|
| 68 |
-
removal of the JSON-decode try/except.
|
| 69 |
-
- Subagent and sidechain skip tests catch real production cases.
|
| 70 |
-
|
| 71 |
-
Caveats:
|
| 72 |
-
- **The "real session" test is hardcoded to one path on the author's
|
| 73 |
-
machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`).
|
| 74 |
-
No env var, no fixture-discovery; the test is `skipif(not exists)`.
|
| 75 |
-
This is a manual integration test, not a CI test. ADR-002 said "CI
|
| 76 |
-
users substitute their own"; the substitution mechanism doesn't
|
| 77 |
-
exist.
|
| 78 |
-
- The synthetic fixture is **author-written** and presumably designed
|
| 79 |
-
alongside the ingester. There is no scrubbed third-party fixture.
|
| 80 |
-
- Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace →
|
| 81 |
-
ingester → collator → 1-step `composer_total_loss`") is **unmet** —
|
| 82 |
-
the spike stops at "ingester emits TraceStates correctly." There is
|
| 83 |
-
no test that takes ingested records, runs the data collator, and runs
|
| 84 |
-
through `compose_loss`.
|
| 85 |
-
|
| 86 |
-
This suite would catch real regressions in the ingester. Its weakest
|
| 87 |
-
property: ships no contributor-runnable real-trace test.
|
| 88 |
-
|
| 89 |
-
### Spike 008 (DiLoCo, 5 tests, single-process)
|
| 90 |
-
|
| 91 |
-
**Verdict: the caveat is honest but says the test does not test what
|
| 92 |
-
users will assume it tests.**
|
| 93 |
-
|
| 94 |
-
BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps
|
| 95 |
-
× 2 outer rounds on the toy model from Spike 005, both replicas converge
|
| 96 |
-
toward the same solution within tolerance."*
|
| 97 |
-
|
| 98 |
-
What ships: **one** replica, mock manager whose `allreduce` is a
|
| 99 |
-
**`passthrough` no-op** (test_diloco_smoke.py:78). This is "one
|
| 100 |
-
replica's outer optimizer machinery fires," not "two replicas
|
| 101 |
-
converge." The acceptance criterion was silently re-defined; the
|
| 102 |
-
spike's verdict.md calls this a "limitation" but it is a redefinition.
|
| 103 |
-
|
| 104 |
-
The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real
|
| 105 |
-
shared-buffer averaging. The implementation hits a "post-hook
|
| 106 |
-
sequencing bug." **One of the recon claim and the implementation is
|
| 107 |
-
wrong**, and the gap is buried in verdict.md instead of fixed.
|
| 108 |
-
|
| 109 |
-
**Genuine value**: `test_diloco_pseudogradient_sign_convention` is the
|
| 110 |
-
**single best test in all of Wave 7-10**. It pins the sign convention
|
| 111 |
-
with a concrete arithmetic prediction (`final == θ_initial + nudge`)
|
| 112 |
-
and reports `wrong_sign_diff` on failure. A future torchft upgrade that
|
| 113 |
-
flips the sign breaks this test loudly. ADR-003 specifically flagged
|
| 114 |
-
this hazard, and the test catches it. Credit where due.
|
| 115 |
-
|
| 116 |
-
**Separate flaw in `composer_diloco.py` docstring (lines 13–28)**: the
|
| 117 |
-
"wrong-sign pseudogradient combined with SGD's subtract-grad semantics
|
| 118 |
-
gives net step in the local-Δ direction once momentum builds up" gloss
|
| 119 |
-
is incoherent. There is no "wrong-sign" pseudogradient.
|
| 120 |
-
`θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's
|
| 121 |
-
`p ← p − lr·g` semantics are designed for it. The test is correct; the
|
| 122 |
-
prose explaining why is wrong, and will mislead anyone porting the
|
| 123 |
-
convention.
|
| 124 |
-
|
| 125 |
-
---
|
| 126 |
-
|
| 127 |
-
## (b) Is the package a real framework or a shim?
|
| 128 |
-
|
| 129 |
-
**Verdict: a structured shim around three real components and two
|
| 130 |
-
stubs. Not yet a framework.**
|
| 131 |
-
|
| 132 |
-
What `pip install composer-replication` delivers:
|
| 133 |
-
- `compose_loss` — labeled in its own top docstring as "Do NOT use as
|
| 134 |
-
the production training loss." Re-exported as the headline package
|
| 135 |
-
API and used in the quickstart.
|
| 136 |
-
- `build_batch` — a hard-coded fixed-conversation factory built for the
|
| 137 |
-
smoke (factorial / binary-search examples). Anyone using this in
|
| 138 |
-
real training is using example code as production.
|
| 139 |
-
- `ClaudeCodeIngester` — real, working component. Solid.
|
| 140 |
-
- `generalized_jsd_loss` — real, working (extracted from OPSD, MIT).
|
| 141 |
-
- `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but
|
| 142 |
-
require OpenRouter credentials + spend.
|
| 143 |
-
- `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass.
|
| 144 |
-
Useful only with `[train]` extra. Not exercised end-to-end on any
|
| 145 |
-
real model in this repo.
|
| 146 |
-
- `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra.
|
| 147 |
-
|
| 148 |
-
What is missing for "pip install and start training":
|
| 149 |
-
1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B.
|
| 150 |
-
2. No CLI. `pyproject.toml` declares no `[project.scripts]`.
|
| 151 |
-
3. No config schema (Hydra/Pydantic). Users hand-construct teacher
|
| 152 |
-
specs, hint generators, data collators.
|
| 153 |
-
4. The `[train]` extra pulls TRL but **no integration test** of
|
| 154 |
-
`ComposerReplicationTrainer` against a real GRPO rollout exists in
|
| 155 |
-
this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out
|
| 156 |
-
precisely to avoid TRL.
|
| 157 |
-
5. **`build_batch` should not be public API.** It belongs in
|
| 158 |
-
`examples/`. Re-exporting at top level implies it is a general-purpose
|
| 159 |
-
utility.
|
| 160 |
-
6. **Two sources of truth**: `composer_replication/loss.py` is a
|
| 161 |
-
near-copy of `spikes/006-…/compose_loss.py` with one import path
|
| 162 |
-
changed. The spike tests still import from the spike file. A bug fix
|
| 163 |
-
in one will not propagate. Same for `composer_diloco.py` ↔
|
| 164 |
-
`composer_replication/diloco/__init__.py`.
|
| 165 |
-
|
| 166 |
-
Real framework value:
|
| 167 |
-
- `ClaudeCodeIngester` with non-trivial logic.
|
| 168 |
-
- `generalized_jsd_loss` with token-clip + temperature.
|
| 169 |
-
- DiLoCo wrapper with sign-pinning test.
|
| 170 |
-
- Sane package layout with optional extras for heavy deps.
|
| 171 |
-
|
| 172 |
-
Net: **a successful directory restructure plus an installable wrapper
|
| 173 |
-
around three real components and two stubs.** Calling Wave 10 "framework
|
| 174 |
-
is installable with working entrypoints (✅)" is letter-of-the-law;
|
| 175 |
-
the brief's "framework" connotation isn't yet earned.
|
| 176 |
-
|
| 177 |
-
---
|
| 178 |
-
|
| 179 |
-
## (c) ADR defensibility
|
| 180 |
-
|
| 181 |
-
### ADR-001 (local 5090 over Modal)
|
| 182 |
-
|
| 183 |
-
**Reasoning defensible; execution missing.** The
|
| 184 |
-
"iteration cycle 25–40s vs 3–5min" argument is concrete and matches
|
| 185 |
-
reality. The "verification smoke, not production" framing is correct.
|
| 186 |
-
|
| 187 |
-
**Gap**: Spike 002a-mini was never run on the 5090 either. Phase 10 in
|
| 188 |
-
DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal,
|
| 189 |
-
and **then nothing ran on either.** No `nvidia-smi` snapshot, no GPU
|
| 190 |
-
step-time CSV, no bf16 numerics check. The "rule out CPU-only blind
|
| 191 |
-
spots" goal is unmet. The ADR should be marked "Accepted (execution
|
| 192 |
-
deferred)" or the spike should run.
|
| 193 |
-
|
| 194 |
-
### ADR-002 (Claude Code JSONL trace source)
|
| 195 |
-
|
| 196 |
-
**Defensible on every dimension the ADR considers; the dimensions are
|
| 197 |
-
partial.** "1,015 real sessions, zero acquisition cost" is real. License
|
| 198 |
-
and schema-stability arguments are well-sourced.
|
| 199 |
-
|
| 200 |
-
**Adversarial counter not in the ADR**: Claude Code JSONL is the most
|
| 201 |
-
self-serving choice. The framework targets training a coding-agent model.
|
| 202 |
-
The training data is the author's own Claude Code sessions where the
|
| 203 |
-
agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and
|
| 204 |
-
*includes Claude*. So:
|
| 205 |
-
- "student action" = what Claude did.
|
| 206 |
-
- teacher pool includes Claude.
|
| 207 |
-
- DPO pairs = teachers' agreement vs Claude's literal text.
|
| 208 |
-
|
| 209 |
-
This is **circular imitation**: training a future model to imitate
|
| 210 |
-
Claude using Claude's outputs as the gold reference and Claude as one
|
| 211 |
-
of the disagreement teachers. The teacher-disagreement signal density
|
| 212 |
-
argument from Spike 001 is strongest with diverse teachers. With this
|
| 213 |
-
trace source, the student-action is locked to one teacher family,
|
| 214 |
-
biasing the disagreement signal. The ADR doesn't consider this; the
|
| 215 |
-
ingester README doesn't flag it. **The ADR rationalizes the easy path
|
| 216 |
-
without naming the data-leakage tradeoff.**
|
| 217 |
-
|
| 218 |
-
### ADR-003 (torchft for DiLoCo)
|
| 219 |
-
|
| 220 |
-
**Genuinely defensible choice.** Meta-maintained library; rolling-own
|
| 221 |
-
trap correctly identified; license analysis (rejecting `diloco_simple`)
|
| 222 |
-
is right; sign-convention risk named and tested.
|
| 223 |
-
|
| 224 |
-
**Gap is in delivery, not decision.** ADR-003 §Consequences §1 says:
|
| 225 |
-
"2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer
|
| 226 |
-
mock allreduce, assertions: replica equality after sync, params actually
|
| 227 |
-
moved, Nesterov state populated, sync count matches expected." Spike 008
|
| 228 |
-
implements one replica + passthrough manager. The ADR commits to an
|
| 229 |
-
implementation that the spike does not deliver, and the gap is flagged
|
| 230 |
-
only in the spike's verdict, not in the ADR.
|
| 231 |
-
|
| 232 |
-
If the recon doc said the pattern was "ready-to-paste" but actually
|
| 233 |
-
hits a sequencing bug, **the recon doc is wrong** and an adversarial
|
| 234 |
-
reviewer is allowed to point that out.
|
| 235 |
-
|
| 236 |
-
---
|
| 237 |
-
|
| 238 |
-
## (d) Scorecard inflation
|
| 239 |
-
|
| 240 |
-
The 5/10 → 9/10 update overstates. Test by test:
|
| 241 |
-
|
| 242 |
-
- **Test 6 (DiLoCo integrated in runnable stack) → ✅?**
|
| 243 |
-
Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and
|
| 244 |
-
fires on one replica. **Zero references to torchft or DiLoCo in
|
| 245 |
-
`composer_trainer.py`** — DiLoCo is not integrated with the trainer.
|
| 246 |
-
No two-replica integration test, no real distributed run.
|
| 247 |
-
|
| 248 |
-
- **Test 7 (real HF model loads + runs) → ✅?**
|
| 249 |
-
Yes — most legitimately closed item. Caveats from §(a) about depth
|
| 250 |
-
of evidence apply, but the literal test is met.
|
| 251 |
-
|
| 252 |
-
- **Test 8 (real LLM-application trace ingested end-to-end) → ✅?**
|
| 253 |
-
Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion
|
| 254 |
-
#3 ("end-to-end: real trace → ingester → collator → 1-step
|
| 255 |
-
`composer_total_loss`") is unmet.**
|
| 256 |
-
|
| 257 |
-
- **Test 9 (framework installable with working entrypoints) → ✅?**
|
| 258 |
-
Letter-of-law yes, spirit partial. `pip install -e .` works; the
|
| 259 |
-
quickstart runs the smoke harness. Production entrypoint
|
| 260 |
-
(`ComposerReplicationTrainer` driven by a config) does not exist.
|
| 261 |
-
|
| 262 |
-
- **Test 10 (non-author can complete the journey) → ✅?**
|
| 263 |
-
No. The supporting evidence is "Quickstart README + working
|
| 264 |
-
installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0."
|
| 265 |
-
Test 10's original journey was "I have Qwen3-7B, I want a
|
| 266 |
-
Composer-style variant." The parenthetical concession in the update
|
| 267 |
-
("For Qwen3-7B etc., GPU phase still gates the empirical demo")
|
| 268 |
-
✅'s the item anyway.
|
| 269 |
-
|
| 270 |
-
**Honest re-scoring**: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8),
|
| 271 |
-
2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points.
|
| 272 |
-
|
| 273 |
-
---
|
| 274 |
-
|
| 275 |
-
## (e) Commit quality
|
| 276 |
-
|
| 277 |
-
```
|
| 278 |
-
ac05fbf Wave 10 — packaging: composer_replication is now pip-installable
|
| 279 |
-
d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines)
|
| 280 |
-
a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo
|
| 281 |
-
57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8
|
| 282 |
-
ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
|
| 283 |
-
040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)
|
| 284 |
-
```
|
| 285 |
-
|
| 286 |
-
- `ac05fbf`, `d52e126`: accurate.
|
| 287 |
-
- `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did
|
| 288 |
-
not actually run cleanly for anyone cloning before this commit. Mild
|
| 289 |
-
overclaim risk on `57af35d`.
|
| 290 |
-
- **`57af35d` is the single most overclaiming commit.** Title: "close
|
| 291 |
-
vision-validation gaps V2/V5/V8."
|
| 292 |
-
- V8: closed in the weakest sense (tautology critique above).
|
| 293 |
-
- V5: structural ingestion closes; BACKLOG acceptance #3 unmet.
|
| 294 |
-
- V2: silently re-defined (one replica, no convergence).
|
| 295 |
-
Three closures claimed; one partial, one redefined.
|
| 296 |
-
- **Chronology problem**: `040eff8` (Wave 6) declared the **5/10 → 9/10
|
| 297 |
-
forecast** in the commit subject. `ac4bfb4` (Wave 7, *next* commit)
|
| 298 |
-
added the BACKLOG and ADRs — i.e., the *plan* to make the forecast
|
| 299 |
-
true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without
|
| 300 |
-
re-auditing whether each item was actually closed in spirit. **No
|
| 301 |
-
commit re-audits the scorecard against actually delivered evidence.**
|
| 302 |
-
|
| 303 |
-
---
|
| 304 |
-
|
| 305 |
-
## (f) Adversarial reviewer's strongest line of attack
|
| 306 |
-
|
| 307 |
-
> "You have a research replication framework whose only published smoke
|
| 308 |
-
> is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO
|
| 309 |
-
> channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel
|
| 310 |
-
> uses dummy reference logprobs, and the GRPO channel is replaced with
|
| 311 |
-
> a stub. Of the three channels you advertise, **zero are tested
|
| 312 |
-
> end-to-end on a real HF model.** Your DiLoCo integration is one
|
| 313 |
-
> replica with a no-op `allreduce`. Your real-trace ingester is tested
|
| 314 |
-
> against a fixture you wrote yourself plus a hardcoded path on your
|
| 315 |
-
> laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no
|
| 316 |
-
> third-party validation, and one commit that closed three vision-
|
| 317 |
-
> validation gaps with one commit message. You are asking the reader to
|
| 318 |
-
> believe that a $9B-startup commercial product is replicated by a CPU
|
| 319 |
-
> smoke and three green test files — none of which the company itself
|
| 320 |
-
> would call 'replicated.'"
|
| 321 |
-
|
| 322 |
-
**Weakest defense**: "It's just v0.1 / smoke phase / GPU is the next
|
| 323 |
-
phase." The *commit log and scorecard claim otherwise.* The defense
|
| 324 |
-
"v0.1 caveat" only works if the v0.1 framing is honest at the top of
|
| 325 |
-
the README and scorecard — and it is not.
|
| 326 |
-
|
| 327 |
-
**Strongest actual defense**: the four primary-source-validated recon
|
| 328 |
-
docs and Spike 001's measured cost floor. The *thesis* is credible and
|
| 329 |
-
auditable. The *implementation phase* is overclaimed.
|
| 330 |
-
|
| 331 |
-
---
|
| 332 |
-
|
| 333 |
-
## What to fix before publishing publicly (priority order)
|
| 334 |
-
|
| 335 |
-
### 1. Re-state the scorecard honestly (BLOCKER)
|
| 336 |
-
Replace 5/10 → 9/10 with **5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
|
| 337 |
-
List the spirit-failures explicitly (test 6 trainer integration, test 8
|
| 338 |
-
end-to-end, test 10 non-author). Single most important fix; everything
|
| 339 |
-
else compounds on the inflated scorecard.
|
| 340 |
-
|
| 341 |
-
### 2. Fix Spike 008's V2 claim (BLOCKER)
|
| 342 |
-
Either (a) add a real two-replica multiprocessing test (ADR-003 says
|
| 343 |
-
this is feasible; the spike claims it isn't — reconcile), or (b) mark
|
| 344 |
-
V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica,
|
| 345 |
-
sign convention pinned; cross-replica convergence deferred to GPU
|
| 346 |
-
phase." Pick one.
|
| 347 |
-
|
| 348 |
-
### 3. Strengthen Spike 006 against the tautology critique
|
| 349 |
-
Two cheap wins:
|
| 350 |
-
- Test that loss decreases on **two alternating fixed batches** over 10
|
| 351 |
-
rounds (not just one memorized batch).
|
| 352 |
-
- Test where **`alpha_sdpo=10.0` and SDPO actually fires** (truncate
|
| 353 |
-
ctx_teacher to T_s tokens for matching shape). The SDPO channel is
|
| 354 |
-
*not exercised on a real HF model anywhere* in the codebase. Largest
|
| 355 |
-
evidence gap for V8.
|
| 356 |
-
|
| 357 |
-
### 4. Run Spike 002a-mini on the local 5090
|
| 358 |
-
ADR-001 made the choice; the spike was not run. Either drop the ADR
|
| 359 |
-
(decision deferred) or run the spike (~30 min wall-clock per ADR's own
|
| 360 |
-
estimate). Until then, the framework has zero GPU evidence of any kind.
|
| 361 |
-
|
| 362 |
-
### 5. Fix the run.log / verdict.md numerical inconsistency
|
| 363 |
-
Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090.
|
| 364 |
-
Either pin the seed properly or document non-reproducibility and quote
|
| 365 |
-
a band rather than exact numbers.
|
| 366 |
-
|
| 367 |
-
### 6. Acknowledge Claude Code JSONL's circularity in ADR-002
|
| 368 |
-
Add a "Risks accepted" entry naming the data-leakage concern: training
|
| 369 |
-
on Claude's outputs while Claude is in the teacher pool produces a
|
| 370 |
-
biased disagreement signal. Spike 007 README should also flag it.
|
| 371 |
-
|
| 372 |
-
### 7. Decide what `compose_loss` and `build_batch` are
|
| 373 |
-
Either rename to `compose_loss_smoke` (and keep
|
| 374 |
-
`ComposerReplicationTrainer._compute_loss` as production), or make
|
| 375 |
-
`compose_loss` actually production-grade and demote `build_batch` out
|
| 376 |
-
of public API. Production-disclaimed harness as the package's headline
|
| 377 |
-
import is confusing.
|
| 378 |
-
|
| 379 |
-
### 8. Eliminate dual sources of truth
|
| 380 |
-
`spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and
|
| 381 |
-
`spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`.
|
| 382 |
-
Make the spike import from the package; delete the duplicate.
|
| 383 |
-
|
| 384 |
-
### 9. Add the missing real-trace end-to-end test in Spike 007
|
| 385 |
-
Take ingester output → Spike 005 data collator → 1 step of `compose_loss`.
|
| 386 |
-
This is BACKLOG acceptance #3; ~50 lines of test code closes V5's
|
| 387 |
-
spirit gap.
|
| 388 |
-
|
| 389 |
-
### 10. Fix the sign-convention docstring in `composer_diloco.py`
|
| 390 |
-
Replace the incoherent "wrong-sign + SGD subtract = right answer with
|
| 391 |
-
momentum" gloss with: *"DiLoCo defines pseudo-gradient as
|
| 392 |
-
`θ_initial − θ_local`; this is the negative of the local update
|
| 393 |
-
direction, and standard SGD subtracts gradients, so the outer step
|
| 394 |
-
moves in the local-update direction. No negation required."* The test
|
| 395 |
-
is correct; the prose explaining it isn't.
|
| 396 |
-
|
| 397 |
-
---
|
| 398 |
-
|
| 399 |
-
## Credit where due
|
| 400 |
-
|
| 401 |
-
- **Spike 007's `ClaudeCodeIngester`** is real, working, well-tested
|
| 402 |
-
software with non-trivial logic (sidechain skip, thinking-block
|
| 403 |
-
strip-on-replay, malformed-line tolerance). The synthetic fixture
|
| 404 |
-
exercises the structural cases properly.
|
| 405 |
-
- **Spike 008's pseudogradient-sign-convention test** is the single
|
| 406 |
-
best test in all of Wave 7-10. It pins a known torchft hazard with an
|
| 407 |
-
explicit arithmetic prediction and a `wrong_sign_diff` reported on
|
| 408 |
-
failure.
|
| 409 |
-
- **Spike 006's α=0 / β=0 ablation tests** would catch real regressions
|
| 410 |
-
and document channel-disable semantics.
|
| 411 |
-
- **All three ADRs are properly traceable to recon documents**
|
| 412 |
-
(MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE,
|
| 413 |
-
DILOCO_RECONNAISSANCE). The decisions can be challenged; the *process*
|
| 414 |
-
is auditable, which is rare.
|
| 415 |
-
- **Package layout** (`loss`, `batch`, `opsd`, `teacher_replay`,
|
| 416 |
-
`ingestion/claude_code`, `diloco`, `trainer`) is sane; optional
|
| 417 |
-
extras correctly avoid forcing TRL/torchft on every install.
|
| 418 |
-
|
| 419 |
-
The work product is not zero. It is overclaimed by roughly one
|
| 420 |
-
scorecard tier and one BACKLOG acceptance criterion. Fixing items
|
| 421 |
-
1, 2, 3, 5 above moves the framework from "publishable with a generous
|
| 422 |
-
reviewer" to "publishable with a critical reviewer." Items 4 and 6
|
| 423 |
-
move it from "research replication" to "evidenced research replication."
|
|
|
|
| 1 |
# Wave 7–10 Final Review — Cross-model adversarial check
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This point-in-time wave review has been moved to
|
| 4 |
+
> [`docs/research/_archive/WAVE_7_10_FINAL_REVIEW.md`](_archive/WAVE_7_10_FINAL_REVIEW.md).
|
| 5 |
+
> It is preserved verbatim for provenance but is superseded by the current
|
| 6 |
+
> [`docs/METHODOLOGY.md`](../METHODOLOGY.md), [`BACKLOG.md`](../../BACKLOG.md), and
|
| 7 |
+
> [`docs/V1_V8_COVERAGE.md`](../V1_V8_COVERAGE.md). See
|
| 8 |
+
> [`docs/research/_archive/README.md`](_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/research/_archive/README.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# docs/research/_archive — historical research reviews
|
| 2 |
+
|
| 3 |
+
This directory holds **research-flavored, point-in-time** documents: the
|
| 4 |
+
cross-model adversarial wave reviews and recon audits that critiqued the
|
| 5 |
+
framework at a specific commit. They are preserved **verbatim for provenance**
|
| 6 |
+
and are **not** maintained as current truth.
|
| 7 |
+
|
| 8 |
+
> **What's current instead.** For live research/methodology, read
|
| 9 |
+
> [`docs/METHODOLOGY.md`](../../METHODOLOGY.md), the standing reconnaissance and
|
| 10 |
+
> landscape notes still in [`docs/research/`](../), and the accepted ADRs under
|
| 11 |
+
> [`docs/adrs/`](../../adrs/README.md). For the live framework state, see
|
| 12 |
+
> [`docs/OVERVIEW.md`](../../OVERVIEW.md) and [`BACKLOG.md`](../../../BACKLOG.md).
|
| 13 |
+
> Where an archived review and a current doc disagree, the current doc wins.
|
| 14 |
+
|
| 15 |
+
Each archived `WAVE_*_FINAL_REVIEW.md` file's original path under
|
| 16 |
+
`docs/research/` still contains a one-line **redirect stub** so older prose
|
| 17 |
+
references (e.g. ADR-007's citation of Wave-13 "Finding 2") keep resolving.
|
| 18 |
+
|
| 19 |
+
## Contents
|
| 20 |
+
|
| 21 |
+
| Archived file | Original path | What it is | Superseded by |
|
| 22 |
+
|---|---|---|---|
|
| 23 |
+
| [`WAVE_7_10_FINAL_REVIEW.md`](WAVE_7_10_FINAL_REVIEW.md) | `docs/research/WAVE_7_10_FINAL_REVIEW.md` | Cross-model adversarial check of Waves 7–10 | `docs/METHODOLOGY.md`; `docs/V1_V8_COVERAGE.md` |
|
| 24 |
+
| [`WAVE_13_FINAL_REVIEW.md`](WAVE_13_FINAL_REVIEW.md) | `docs/research/WAVE_13_FINAL_REVIEW.md` | Wave 13 adversarial cross-model review (ADR-007 cites Finding 2) | ADR-007; `BACKLOG.md` |
|
| 25 |
+
| [`WAVE_14_FINAL_REVIEW.md`](WAVE_14_FINAL_REVIEW.md) | `docs/research/WAVE_14_FINAL_REVIEW.md` | Wave 14 adversarial cross-model review | `docs/TROUBLESHOOTING.md`; `BACKLOG.md` |
|
| 26 |
+
| [`WAVE_15_FINAL_REVIEW.md`](WAVE_15_FINAL_REVIEW.md) | `docs/research/WAVE_15_FINAL_REVIEW.md` | Wave 15 multi-angle self-critique + fix wave | `docs/V1_V8_COVERAGE.md`; `BACKLOG.md` |
|
| 27 |
+
| [`WAVE_16_RECON_AUDIT.md`](WAVE_16_RECON_AUDIT.md) | (already archived here) | Wave 16 reconnaissance audit of the recon/landscape docs | the recon/landscape docs it audited |
|
| 28 |
+
|
| 29 |
+
## Why archive instead of delete
|
| 30 |
+
|
| 31 |
+
These reviews are the framework's adversarial audit trail — they record which
|
| 32 |
+
claims were challenged, which were fixed, and which were left open. Several
|
| 33 |
+
accepted ADRs cite them by path. Archiving (rather than deleting) preserves the
|
| 34 |
+
provenance and keeps those citations resolvable, while signalling that the
|
| 35 |
+
reviews are snapshots superseded by the current methodology and coverage docs.
|
| 36 |
+
|
| 37 |
+
> Non-research point-in-time artifacts (wave logs, the dated cross-family /
|
| 38 |
+
> final-verify review bundles) live in the sibling
|
| 39 |
+
> [`docs/_archive/`](../../_archive/README.md) instead.
|
docs/research/_archive/WAVE_13_FINAL_REVIEW.md
ADDED
|
@@ -0,0 +1,239 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave 13 Adversarial Cross-Model Review
|
| 2 |
+
|
| 3 |
+
**Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task)
|
| 4 |
+
**Date:** 2026-05-26
|
| 5 |
+
**Scope:** Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules)
|
| 6 |
+
**Method:** Read-and-grep audit + targeted test runs (CPU)
|
| 7 |
+
|
| 8 |
+
## Top-line verdict
|
| 9 |
+
|
| 10 |
+
**CONDITIONAL PASS with two BLOCKERs.** Wave 13 substantially advances
|
| 11 |
+
the brief expansion (serverless DiLoCo abstraction, replaysim
|
| 12 |
+
normalization, three distillation losses, PRIME-RL recipe, Monarch
|
| 13 |
+
tie-in). The **distillation losses are the strongest deliverable** —
|
| 14 |
+
real, well-tested, mathematically faithful to the cited papers. The
|
| 15 |
+
serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are
|
| 16 |
+
also genuine and exercised by 3 real multi-process tests.
|
| 17 |
+
|
| 18 |
+
**However, two material claims are not test-validated, and one new
|
| 19 |
+
module silently produces a degenerate loss in its primary code path.**
|
| 20 |
+
ADR claims that say "X is added to compose_loss" describe code that
|
| 21 |
+
wasn't actually written. The MockManager → DiLoCo "drop-in" is
|
| 22 |
+
unverified end-to-end.
|
| 23 |
+
|
| 24 |
+
Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2
|
| 25 |
+
BLOCKERs + 4 SUGGESTIONs + 2 NITs**.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)
|
| 30 |
+
|
| 31 |
+
**Severity:** BLOCKER
|
| 32 |
+
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:79-86`
|
| 33 |
+
|
| 34 |
+
The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)`
|
| 35 |
+
log-prob tensors before passing them to `generalized_jsd_loss`, which
|
| 36 |
+
calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector
|
| 37 |
+
is exactly 1.0; its log is 0. Therefore both `student_log_probs` and
|
| 38 |
+
`teacher_log_probs` are identically zero, the JSD between them is 0,
|
| 39 |
+
and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or
|
| 40 |
+
the actual log-prob values.**
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
>>> import torch.nn.functional as F
|
| 44 |
+
>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
|
| 45 |
+
tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
The docstring calls this "a deliberate approximation," but it is not
|
| 49 |
+
an approximation — it's a mathematically degenerate operation that
|
| 50 |
+
silently disables channel 2.
|
| 51 |
+
|
| 52 |
+
**Fix direction:**
|
| 53 |
+
- Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising
|
| 54 |
+
`NotImplementedError` until PRIME-RL surfaces full logits.
|
| 55 |
+
- Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL
|
| 56 |
+
has working SDPO; mark it deferred.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added
|
| 61 |
+
|
| 62 |
+
**Severity:** BLOCKER
|
| 63 |
+
**Evidence:**
|
| 64 |
+
- `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims:
|
| 65 |
+
> `composer_replication.compose_loss` gets new optional kwargs:
|
| 66 |
+
> - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3
|
| 67 |
+
> - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2
|
| 68 |
+
> - `taid_schedule_step: int | None = None`
|
| 69 |
+
> - `taid_total_steps: int | None = None`
|
| 70 |
+
- `composer_replication/loss.py:54-65` actual signature has **none**
|
| 71 |
+
of these. `grep -n "dpo_variant\|sdpo_wrapper\|taid"
|
| 72 |
+
composer_replication/loss.py` returns empty.
|
| 73 |
+
|
| 74 |
+
The new losses live in `composer_replication.distillation` as
|
| 75 |
+
standalone functions but **are not wired into the framework's actual
|
| 76 |
+
loss composition.** A user reading ADR-007 + the README would believe
|
| 77 |
+
`compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)`
|
| 78 |
+
works; it would raise `TypeError`. The 17 distillation tests verify
|
| 79 |
+
the standalone losses but never exercise integration.
|
| 80 |
+
|
| 81 |
+
**Fix direction:**
|
| 82 |
+
- Either (a) add the kwargs to `compose_loss` and write at least one
|
| 83 |
+
integration test combining e.g. SDPO+TAID (~30 LOC change), or
|
| 84 |
+
- (b) downgrade ADR-007 status to "Standalone losses landed;
|
| 85 |
+
integration deferred to Wave 14."
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields
|
| 90 |
+
|
| 91 |
+
**Severity:** SUGGESTION (would be BLOCKER if a test exercised the real path)
|
| 92 |
+
**Evidence:**
|
| 93 |
+
- `composer_replication/recipes/replaysim/default.yaml` configures
|
| 94 |
+
`text_length_filter`, `words_num_filter`, `special_characters_filter`,
|
| 95 |
+
`document_deduplicator` with `text_keys: ["chosen", "rejected"]`.
|
| 96 |
+
- In the record produced by `_dpo_pair_to_dj_record`, `chosen` and
|
| 97 |
+
`rejected` are **lists of dicts**
|
| 98 |
+
(`[{"role": "assistant", "content": "..."}]`) — not strings.
|
| 99 |
+
- data-juicer's `text_length_filter` expects string-typed fields;
|
| 100 |
+
running it on a list will either crash or no-op silently.
|
| 101 |
+
|
| 102 |
+
The reason no test catches this: tests only validate the real path *if
|
| 103 |
+
data-juicer is installed*, and even then only check `__init__` succeeds.
|
| 104 |
+
There is no test that calls `normalize()` against a real data-juicer
|
| 105 |
+
executor with the default recipe.
|
| 106 |
+
|
| 107 |
+
**Fix direction:**
|
| 108 |
+
- Reshape `_dpo_pair_to_dj_record` to extract `content` strings
|
| 109 |
+
alongside the messages-format list.
|
| 110 |
+
- Add one test (skip-marked unless `data_juicer` is importable) that
|
| 111 |
+
runs the real op-graph on 3 hand-crafted records.
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end
|
| 116 |
+
|
| 117 |
+
**Severity:** SUGGESTION
|
| 118 |
+
**Evidence:**
|
| 119 |
+
- `composer_replication/diloco/serverless/allreduce.py:188-191` claims
|
| 120 |
+
MockManager "drops into" `make_diloco_outer_loop`.
|
| 121 |
+
- The only test covering MockManager (`test_mock_manager_shape_compat`)
|
| 122 |
+
is a `hasattr` smoke that calls `.allreduce` on a `world_size=1`
|
| 123 |
+
store (passthrough).
|
| 124 |
+
- torchft.Manager has additional surface area
|
| 125 |
+
(`current_step`, `is_leader`, `_pg`, `report_error`,
|
| 126 |
+
internal step accounting) that DiLoCo's `_apply_pseudogradient`
|
| 127 |
+
may consult depending on version.
|
| 128 |
+
|
| 129 |
+
**Fix direction:**
|
| 130 |
+
- Add a single integration test that constructs
|
| 131 |
+
`make_diloco_outer_loop(manager=MockManager(store), ...)` against a
|
| 132 |
+
tiny `nn.Linear` and runs one outer round — even single-process.
|
| 133 |
+
- Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and
|
| 134 |
+
add stubs for any methods DiLoCo actually consults beyond `allreduce`.
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated
|
| 139 |
+
|
| 140 |
+
**Severity:** SUGGESTION (NIT bordering)
|
| 141 |
+
**Evidence:**
|
| 142 |
+
- README.md and V1_V8_COVERAGE both state: *"9 multi-process tests
|
| 143 |
+
pinning the allreduce barrier."*
|
| 144 |
+
- Actual breakdown:
|
| 145 |
+
- 4 single-process unit tests + `test_mock_manager_shape_compat` (5)
|
| 146 |
+
- 4 multi-process tests spawning subprocesses (parametrized [2,3] of
|
| 147 |
+
`_runs_allreduce_across_replicas`, `_handles_multiple_rounds`,
|
| 148 |
+
`_reports_failed_replicas`)
|
| 149 |
+
- Of the 4 multi-process tests, only **3 actually exercise the
|
| 150 |
+
allreduce barrier**; `_reports_failed_replicas` deliberately raises
|
| 151 |
+
before any allreduce call.
|
| 152 |
+
|
| 153 |
+
**Wave 13 clearly does NOT fake-pass via world_size=1** — the multi-
|
| 154 |
+
process barrier is real. But the count is rounded up.
|
| 155 |
+
|
| 156 |
+
**Fix direction:** Replace "9 multi-process tests" with "9 tests
|
| 157 |
+
covering the serverless DiLoCo layer, of which 4 spawn real
|
| 158 |
+
subprocesses and 3 exercise the allreduce barrier across replicas."
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`
|
| 163 |
+
|
| 164 |
+
**Severity:** SUGGESTION
|
| 165 |
+
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:62-68`
|
| 166 |
+
computes:
|
| 167 |
+
```python
|
| 168 |
+
grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
This is plain REINFORCE with advantage. PRIME-RL's `LossInputs`
|
| 172 |
+
exposes `inference_logprobs` precisely because GRPO-with-replay-buffer
|
| 173 |
+
requires the importance-sampling ratio
|
| 174 |
+
`exp(trainer_lp - inference_lp)` (PPO-style clipped objective).
|
| 175 |
+
|
| 176 |
+
The file says "SKELETON" so this isn't a hidden bug per se, but the
|
| 177 |
+
loss is **labeled GRPO and is not GRPO**.
|
| 178 |
+
|
| 179 |
+
**Fix direction:** Either implement the ratio + clipping (~20 LOC) or
|
| 180 |
+
rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__`
|
| 185 |
+
|
| 186 |
+
**Severity:** NIT (this is documented, but README phrasing is slightly soft)
|
| 187 |
+
**Evidence:** Honestly documented as skeletons in the code, ADR-005,
|
| 188 |
+
and README. NIT: a user trying `ModalExecutor()` gets a runtime error
|
| 189 |
+
rather than an import-time clue.
|
| 190 |
+
|
| 191 |
+
**Fix direction:** Low priority. Update README phrase to "skeleton-only
|
| 192 |
+
— raises NotImplementedError until v0.x." Or use a `__getattr__` on
|
| 193 |
+
the package that raises a clearer message.
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)
|
| 198 |
+
|
| 199 |
+
**Severity:** NIT
|
| 200 |
+
**Evidence:** `test_distillation_losses.py:27-46` calls `simpo_loss`
|
| 201 |
+
with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded
|
| 202 |
+
above by 0; positive values aren't possible from any softmax. The tests
|
| 203 |
+
still verify the formula correctly, but the test inputs aren't legal.
|
| 204 |
+
|
| 205 |
+
**Fix direction:** Use negative values — purely cosmetic.
|
| 206 |
+
|
| 207 |
+
---
|
| 208 |
+
|
| 209 |
+
## Cross-cutting risk check
|
| 210 |
+
|
| 211 |
+
73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still
|
| 212 |
+
pass. The new `composer_replication.diloco.serverless` package is
|
| 213 |
+
purely additive; the existing `make_diloco_outer_loop` is untouched.
|
| 214 |
+
**No cross-wave regressions detected on CPU.** GPU tests + slow CPU
|
| 215 |
+
e2e tests not re-run; regression risk low since Wave 13 doesn't touch
|
| 216 |
+
their dependencies.
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## Summary scorecard
|
| 221 |
+
|
| 222 |
+
| Item | Verdict |
|
| 223 |
+
|---|---|
|
| 224 |
+
| Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone | ✅ Real, well-tested, paper-faithful |
|
| 225 |
+
| Distillation integrated into `compose_loss` | ❌ **Not implemented** despite ADR-007 (Finding 2) |
|
| 226 |
+
| ObjectStoreAllReduce + LocalProcessExecutor | ✅ Real multi-process barrier validated |
|
| 227 |
+
| MockManager → DiLoCo drop-in | 🟡 Shape-checked only; integration unverified (Finding 4) |
|
| 228 |
+
| Modal/HFJobs adapters | 🟡 Honestly documented as skeletons (Finding 7) |
|
| 229 |
+
| Replaysim DJNormalizer passthrough | ✅ Works |
|
| 230 |
+
| Replaysim default.yaml against real data-juicer | ❌ **Recipe field types don't match record shape** (Finding 3) |
|
| 231 |
+
| PRIME-RL composer_loss.loss_fn | ❌ **SDPO term silently 0** (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) |
|
| 232 |
+
| Monarch actors | ✅ Honest skeleton; raises NotImplementedError |
|
| 233 |
+
| Altered-minds tie-in doc | ✅ Design-only, scoped honestly |
|
| 234 |
+
| 35 new tests | All pass; 3 of 4 multi-process tests are genuine (Finding 5) |
|
| 235 |
+
|
| 236 |
+
**Recommendation:** Address Findings 1 and 2 before publishing the
|
| 237 |
+
Wave 13 expansion as "closed." Findings 3 and 4 should be addressed
|
| 238 |
+
before any user attempts the real data-juicer or real torchft DiLoCo
|
| 239 |
+
path. Findings 5–8 are cleanup.
|
docs/research/_archive/WAVE_14_FINAL_REVIEW.md
ADDED
|
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave 14 Adversarial Cross-Model Review
|
| 2 |
+
|
| 3 |
+
**Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task)
|
| 4 |
+
**Date:** 2026-05-26
|
| 5 |
+
**Method:** Read every Wave 13 finding, every Wave 14 closure, all 4 doc files, **cloned PRIME-RL upstream to verify T4 claims**, ran 61 wave-relevant tests.
|
| 6 |
+
|
| 7 |
+
## Top-line verdict
|
| 8 |
+
|
| 9 |
+
**CONDITIONAL PASS with 1 BLOCKER + 4 SUGGESTIONs + 2 NITs.** Wave 14
|
| 10 |
+
closes Wave 13 BLOCKER 2 (T1 — compose_loss kwargs) and Suggestion 3
|
| 11 |
+
(T2 — replaysim) cleanly. T3 (MockManager surface audit) is solid but
|
| 12 |
+
only tests `world_size=1`. **T4 (PRIME-RL "real GRPO + DPPO") does not
|
| 13 |
+
match PRIME-RL's actual `default_loss_fn`** despite claiming to mirror
|
| 14 |
+
it; that error has been pasted into USER_GUIDE.md, INTEGRATION_RECIPES.md,
|
| 15 |
+
and API_REFERENCE.md.
|
| 16 |
+
|
| 17 |
+
Same signal-to-noise as Wave 11 + Wave 13 reviewers: 1 genuine BLOCKER.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Finding 1 — BLOCKER: T4 PRIME-RL "DPPO importance-sampling-ratio clip" is neither importance sampling nor matches PRIME-RL.
|
| 22 |
+
|
| 23 |
+
**Severity:** BLOCKER
|
| 24 |
+
**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:118-131`.
|
| 25 |
+
The implementation computes
|
| 26 |
+
```python
|
| 27 |
+
grpo_loss = -(advantages * trainer_lp * keep_mask).sum() / keep_mask.sum()
|
| 28 |
+
```
|
| 29 |
+
That's **pure REINFORCE-with-advantage** — the masking gate is the only
|
| 30 |
+
nod toward DPPO; there is no importance-sampling ratio multiplication
|
| 31 |
+
anywhere.
|
| 32 |
+
|
| 33 |
+
**Real PRIME-RL** (`/tmp/prime-rl-clone/src/prime_rl/trainer/rl/loss.py:128-153`,
|
| 34 |
+
the `default_loss_fn` on `main` as of 2026-05-26):
|
| 35 |
+
```python
|
| 36 |
+
log_importance_ratio = trainer_logprobs - inference_logprobs
|
| 37 |
+
importance_ratio = torch.exp(log_importance_ratio)
|
| 38 |
+
probs_diff = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs)
|
| 39 |
+
positive_advantages = advantages > 0
|
| 40 |
+
dppo_invalid_mask_high = probs_diff > loss_config.dppo_mask_high
|
| 41 |
+
dppo_invalid_mask_low = probs_diff < -loss_config.dppo_mask_low
|
| 42 |
+
dppo_invalid_mask = torch.where(positive_advantages, dppo_invalid_mask_high, dppo_invalid_mask_low)
|
| 43 |
+
keep_mask = loss_mask & ~dppo_invalid_mask
|
| 44 |
+
pg_loss = -(keep_mask * advantages * importance_ratio).sum() # NO division
|
| 45 |
+
kl_loss = adv_tau * (log_importance_ratio**2 * keep_mask).sum() # KL term
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
**Three concrete divergences from Wave 14's implementation:**
|
| 49 |
+
|
| 50 |
+
1. **Mask gate is on `probs_diff`** (a probability-space quantity), NOT
|
| 51 |
+
`log_ratio` (a log-space quantity). These have different magnitudes:
|
| 52 |
+
`probs_diff=0.2` corresponds to `log_ratio≈log(1.2)≈0.18` for a
|
| 53 |
+
trainer prob of 1.0 vs inference prob of 0.8. With our `log_ratio>4.0`
|
| 54 |
+
gate, the mask never fires for normal training distributions; PRIME-RL's
|
| 55 |
+
`probs_diff>0.2` gate fires routinely.
|
| 56 |
+
|
| 57 |
+
2. **PRIME-RL multiplies by `importance_ratio = exp(log_ratio)`**;
|
| 58 |
+
Wave 14 multiplies by `trainer_lp` directly. This is the difference
|
| 59 |
+
between actual policy-gradient correction (PRIME-RL) and naive
|
| 60 |
+
REINFORCE.
|
| 61 |
+
|
| 62 |
+
3. **PRIME-RL's mask is sign-conditioned on advantage** (positive
|
| 63 |
+
advantages clipped against `dppo_mask_high`, negative against
|
| 64 |
+
`-dppo_mask_low`); Wave 14 ORs them together unconditionally.
|
| 65 |
+
|
| 66 |
+
**Plus:** the KL term is missing entirely.
|
| 67 |
+
|
| 68 |
+
**Plus:** the defaults claimed as "PRIME-RL's defaults" — `dppo_mask_high=4.0,
|
| 69 |
+
dppo_mask_low=-4.0` — are wrong. PRIME-RL's `DefaultLossConfig`
|
| 70 |
+
(`configs/trainer.py:412-424`) sets `dppo_mask_high=0.2, dppo_mask_low=0.2`
|
| 71 |
+
with `Field(..., ge=0)` validation that would *reject* a negative value.
|
| 72 |
+
PRIME-RL's code negates at use site: `probs_diff < -loss_config.dppo_mask_low`.
|
| 73 |
+
|
| 74 |
+
**Plus:** the docstring (`composer_loss.py:32-49`), USER_GUIDE.md:599-608,
|
| 75 |
+
INTEGRATION_RECIPES.md:426-429 + 482-487, and API_REFERENCE.md:1364 all
|
| 76 |
+
repeat the wrong formula and the wrong "matches PRIME-RL" claim.
|
| 77 |
+
|
| 78 |
+
**Fix direction:** Either (a) actually mirror `default_loss_fn` (mask on
|
| 79 |
+
`probs_diff`, multiply by `importance_ratio`, add KL term, advantage-
|
| 80 |
+
conditioned mask, `.sum()` reduction with token-count returned for
|
| 81 |
+
caller-side scaling), or (b) drop the "matches PRIME-RL" framing and
|
| 82 |
+
rename to "REINFORCE-with-advantage stub + log-ratio mask" everywhere.
|
| 83 |
+
|
| 84 |
+
Wave 13 Finding 6 is **not actually closed** by Wave 14.
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## Finding 2 — SUGGESTION: ADR-007 still says Wave 14 hasn't done the integration.
|
| 89 |
+
|
| 90 |
+
**Severity:** SUGGESTION
|
| 91 |
+
**Evidence:** `docs/adrs/ADR-007-self-distillation-losses.md:104-122` reads:
|
| 92 |
+
> **Wave 14+ work — `compose_loss` integration is NOT in this wave**
|
| 93 |
+
> ... Wave 14 plan: add the four kwargs ...
|
| 94 |
+
|
| 95 |
+
But Wave 14 *did* add them (verified — `loss.py:80-93`). The ADR was
|
| 96 |
+
written defensively after Wave 13 review and never updated when T1 landed.
|
| 97 |
+
|
| 98 |
+
**Net effect:** a user reading ADR-007 is told the SimPO/TAID kwargs
|
| 99 |
+
don't work; a user reading USER_GUIDE/API_REFERENCE is told they do.
|
| 100 |
+
|
| 101 |
+
**Fix direction:** flip ADR-007 status section to "Closed in Wave 14 —
|
| 102 |
+
see test_compose_loss_integration.py".
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Finding 3 — SUGGESTION: ModalExecutor instantiation example in INTEGRATION_RECIPES is dead code.
|
| 107 |
+
|
| 108 |
+
**Severity:** SUGGESTION
|
| 109 |
+
**Evidence:** `docs/INTEGRATION_RECIPES.md:519-533` shows
|
| 110 |
+
```python
|
| 111 |
+
executor = ModalExecutor(app="composer-prime-rl")
|
| 112 |
+
executor.launch_replicas(...)
|
| 113 |
+
```
|
| 114 |
+
But `composer_replication/diloco/serverless/modal.py:64-66` raises
|
| 115 |
+
`NotImplementedError` from `__init__`. Same pattern in `HFJobsExecutor`.
|
| 116 |
+
The recipe doc warns about skeleton-status much further down (line 731),
|
| 117 |
+
but the inline code example at line 519 will break the moment a reader
|
| 118 |
+
copy-pastes it.
|
| 119 |
+
|
| 120 |
+
Wave 13 Finding 7 noted this softness; Wave 14 made it worse by writing
|
| 121 |
+
example code that calls a constructor that always raises.
|
| 122 |
+
|
| 123 |
+
**Fix direction:** in every code block that calls `ModalExecutor(...)`,
|
| 124 |
+
prepend a comment `# Wave 14: skeleton — raises NotImplementedError`
|
| 125 |
+
or flip examples to `LocalProcessExecutor`.
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Finding 4 — SUGGESTION: MockManager + DiLoCo integration test only exercises `world_size=1`.
|
| 130 |
+
|
| 131 |
+
**Severity:** SUGGESTION
|
| 132 |
+
**Evidence:** `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py:44-51`,
|
| 133 |
+
`:108-109`, `:161`. Both `test_mockmanager_diloco_outer_round_completes`
|
| 134 |
+
and `test_mockmanager_diloco_two_outer_rounds_step_counter` use
|
| 135 |
+
`world_size=1`.
|
| 136 |
+
|
| 137 |
+
With one replica, `ObjectStoreAllReduce.allreduce` returns the tensor
|
| 138 |
+
unchanged (its own mean), so an averaging bug in the multi-replica path
|
| 139 |
+
could not be caught by this test. The pseudo-gradient sign convention
|
| 140 |
+
is pinned by the unrelated spike-008 test, but **no test combines
|
| 141 |
+
MockManager + DiLoCo + multi-process** — i.e. the actual deployment
|
| 142 |
+
scenario is unverified end-to-end.
|
| 143 |
+
|
| 144 |
+
Wave 13 Finding 4 is closed in spirit (call surface is now exhaustive)
|
| 145 |
+
but not in the deepest sense.
|
| 146 |
+
|
| 147 |
+
**Fix direction:** add one multi-process test that spawns `n_replicas`
|
| 148 |
+
subprocesses, each constructing `MockManager(store) → make_diloco_outer_loop`,
|
| 149 |
+
and asserts that after one outer round all replicas converge to the same
|
| 150 |
+
parameter values (i.e. averaging actually happened).
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## Finding 5 — SUGGESTION: T4 unit tests pin the wrong implementation as ground truth.
|
| 155 |
+
|
| 156 |
+
**Severity:** SUGGESTION
|
| 157 |
+
**Evidence:** `composer_replication/recipes/prime_rl/tests/test_composer_loss.py:90-128`
|
| 158 |
+
(`test_dppo_mask_clips_extreme_ratios`). The expected value `1.5/3` is
|
| 159 |
+
computed against the buggy formula (Finding 1).
|
| 160 |
+
|
| 161 |
+
The 10 PRIME-RL tests all pass — but they're testing self-consistency,
|
| 162 |
+
not parity with PRIME-RL. A reader looking at "10 unit tests, all green"
|
| 163 |
+
infers correctness; correctness is not what they verify. This is the
|
| 164 |
+
kind of test honesty failure that Wave 11 + Wave 13 reviewers found in
|
| 165 |
+
different forms.
|
| 166 |
+
|
| 167 |
+
**Fix direction:** add at least one test whose expected value is
|
| 168 |
+
hand-computed from `default_loss_fn` in PRIME-RL (or import + invoke
|
| 169 |
+
`default_loss_fn` if the dependency is available, mark the test
|
| 170 |
+
`@pytest.mark.skipif(not _HAS_PRIME_RL)`).
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
## Finding 6 — NIT: README/test-count drift.
|
| 175 |
+
|
| 176 |
+
Wave 14 task description claims "124 tests passing as of Wave 14"; actual
|
| 177 |
+
`pytest --collect-only` reports **134 collected**. Of those, the 61-test
|
| 178 |
+
wave-relevant subset all pass. Not a defect, but the headline number is
|
| 179 |
+
now off in the same way Wave 13's "9 multi-process tests" was off.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## Finding 7 — NIT: `loss_fn` docstring claims "DPPO importance-sampling-ratio clipping — implemented" (`composer_loss.py:9`).
|
| 184 |
+
|
| 185 |
+
Implementation contains no importance-ratio multiplication anywhere.
|
| 186 |
+
Even if Finding 1 is rejected and the team decides "PRIME-RL match isn't
|
| 187 |
+
a goal", the docstring is internally false: it announces ISR clipping
|
| 188 |
+
in a function that does not multiply by `exp(log_ratio)`.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## Cross-cutting
|
| 193 |
+
|
| 194 |
+
The four doc subagents wrote internally consistent text but inherited
|
| 195 |
+
T4's mathematical error. **Three of the four doc files repeat the same
|
| 196 |
+
wrong formula verbatim.** This is exactly the failure mode Wave 11/13
|
| 197 |
+
reviewers flagged: parallel subagents cross-citing each other rather
|
| 198 |
+
than the upstream source of truth.
|
| 199 |
+
|
| 200 |
+
The 61 tests in the Wave-14-touched dirs pass cleanly. T1, T2, and T3
|
| 201 |
+
are real closures with real coverage. The framework is in a **better**
|
| 202 |
+
state than end-of-Wave-13 — but it has not actually closed Wave 13
|
| 203 |
+
Finding 6, and it has propagated a subtler version of the same
|
| 204 |
+
mathematical-mismatch bug into the user-facing documentation.
|
| 205 |
+
|
| 206 |
+
---
|
| 207 |
+
|
| 208 |
+
## Summary scorecard
|
| 209 |
+
|
| 210 |
+
| Wave 13 Finding | Wave 14 status | Verdict |
|
| 211 |
+
|---|---|---|
|
| 212 |
+
| BLOCKER 1 (PRIME-RL SDPO degenerate) | Fixed parent-side; channel 2 raises NotImplementedError | ✅ closed |
|
| 213 |
+
| BLOCKER 2 (compose_loss kwargs not added) | T1 added them + 11 integration tests | ✅ closed |
|
| 214 |
+
| Suggestion 3 (replaysim YAML field types) | T2 dual-shape reshape + real DJ e2e + caught related bug | ✅ closed |
|
| 215 |
+
| Suggestion 4 (MockManager → DiLoCo gap) | T3 surface audit + integration test | 🟡 closed for `world_size=1`; multi-process unverified |
|
| 216 |
+
| Suggestion 5 ("9 multi-process tests" inflated count) | Not addressed | 🟡 carried over |
|
| 217 |
+
| Suggestion 6 (PRIME-RL channel 1 REINFORCE not GRPO) | T4 thought it closed this | ❌ **NOT closed — mathematically wrong** |
|
| 218 |
+
| Suggestion 7 (Modal/HFJobs skeleton clarity) | Made worse by INTEGRATION_RECIPES dead code | 🟡 regression |
|
| 219 |
+
| NIT 8 (SimPO test positive log-probs) | Not addressed | 🟡 carried over |
|
| 220 |
+
|
| 221 |
+
## Wave 14b follow-up (2026-05-26)
|
| 222 |
+
|
| 223 |
+
After Wave 14b closed Finding 1 by re-reading PRIME-RL upstream and
|
| 224 |
+
matching `default_loss_fn` byte-for-byte, the Wave 14b subagent flagged
|
| 225 |
+
a **new** structural issue not in the Wave 14 review:
|
| 226 |
+
|
| 227 |
+
**PRIME-RL's `setup_loss_fns` (upstream `loss.py:320-327`) expects the
|
| 228 |
+
custom loss function to return `LossOutputs(loss, metrics={...})`, not
|
| 229 |
+
a bare scalar tensor.** Our recipe still returns a bare scalar. This
|
| 230 |
+
predates Wave 14 (it's been wrong since the recipe was first written in
|
| 231 |
+
Wave 13) but was never caught because no test runs against actual
|
| 232 |
+
PRIME-RL.
|
| 233 |
+
|
| 234 |
+
**Status:** documented; deferred to Wave 15. Not blocking for Wave 14b's
|
| 235 |
+
closure of Finding 1, because the formula now matches upstream — the
|
| 236 |
+
return-shape mismatch is a separate adapter-level issue. Tests still
|
| 237 |
+
pass because they invoke our `loss_fn` directly without going through
|
| 238 |
+
PRIME-RL's `compute_loss` pipeline.
|
| 239 |
+
|
| 240 |
+
**Fix direction (Wave 15):** wrap the return value in a duck-typed
|
| 241 |
+
`LossOutputs` (provided by PRIME-RL when installed; substituted with a
|
| 242 |
+
NamedTuple shim when not). Add an integration smoke test against PRIME-RL
|
| 243 |
+
to catch this and similar adapter-shape regressions.
|
| 244 |
+
|
| 245 |
+
## Final Wave 14 + 14b status
|
| 246 |
+
|
| 247 |
+
| Wave 13 / 14 finding | Wave 14b status |
|
| 248 |
+
|---|---|
|
| 249 |
+
| W13 BLOCKER 1: PRIME-RL SDPO degenerate | ✅ closed (parent, channel 2 deferred) |
|
| 250 |
+
| W13 BLOCKER 2: compose_loss kwargs not added | ✅ closed (Wave 14 T1) |
|
| 251 |
+
| W13 Suggestion 3: replaysim YAML field types | ✅ closed (Wave 14 T2) |
|
| 252 |
+
| W13 Suggestion 4: MockManager → DiLoCo gap | ✅ closed (Wave 14 T3 + Wave 14b multi-process test) |
|
| 253 |
+
| W13 Suggestion 6: PRIME-RL channel 1 REINFORCE not GRPO | ✅ **closed in Wave 14b** (matches upstream `default_loss_fn`) |
|
| 254 |
+
| W14 Finding 1: PRIME-RL impl wrong | ✅ closed in Wave 14b |
|
| 255 |
+
| W14 Finding 2: ADR-007 stale | ✅ closed in Wave 14b |
|
| 256 |
+
| W14 Finding 3: ModalExecutor dead code | ✅ closed in Wave 14b |
|
| 257 |
+
| W14 Finding 4: world_size=1 only | ✅ closed in Wave 14b (multi-process convergence test) |
|
| 258 |
+
| W14 Finding 5: tests pin wrong impl as ground truth | ✅ closed in Wave 14b (parity test added) |
|
| 259 |
+
| W14 NIT 6: test count drift | 🟡 carried |
|
| 260 |
+
| W14 NIT 7: docstring claims ISR clipping | ✅ closed in Wave 14b (real ISR now implemented) |
|
| 261 |
+
| **NEW (Wave 14b)**: PRIME-RL `LossOutputs` return shape | 🟡 deferred to Wave 15 |
|
| 262 |
+
|
| 263 |
+
**Tests as of Wave 14b: 115 passing + 1 skip-marked (OPSD parity test, runs when upstream cloned).** (Wave 12: 72; Wave 13: 93; Wave 14: 124; Wave 14b: 130; Wave 15: 115 after TAID rewrite consolidation + OPSD parity.)
|
docs/research/_archive/WAVE_15_FINAL_REVIEW.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave 15 Final Review — Multi-Angle Self-Critique + Fix Wave
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-05-26
|
| 4 |
+
**Method:** 4 parallel adversarial reviewers (math / tests / docs / user-journey), each given a different framing to maximize independent-angle coverage. Then targeted fix scatter on findings.
|
| 5 |
+
|
| 6 |
+
## Headline finding
|
| 7 |
+
|
| 8 |
+
**The math reviewer found 2 BLOCKERs that all 8+ prior subagents missed.** Both came from `git clone`-ing upstream and doing line-by-line diffs against the framework's `composer_replication/opsd.py` and `composer_replication/distillation/taid.py` — something no prior reviewer had done for those files (Wave 14b reviewer did it for PRIME-RL only).
|
| 9 |
+
|
| 10 |
+
This validates the user's instinct that "every angle" multi-model orchestration is worth doing — the math angle, given a sharp prompt that mandated upstream verification, found genuine bugs in the framework's primary loss kernel.
|
| 11 |
+
|
| 12 |
+
## Wave 15a reviews (all 4 deliverables)
|
| 13 |
+
|
| 14 |
+
| Reviewer | Focus | BLOCKERs | Severity-weighted findings |
|
| 15 |
+
|---|---|---|---|
|
| 16 |
+
| Math correctness (Opus 4.7) | 7 claimed implementations vs primary sources | **2 BLOCKER + 3 minor** | `generalized_jsd_loss` math wrong; `taid_loss` algorithm wrong |
|
| 17 |
+
| Test honesty (Opus 4.7) | 3 specific test files | 0 BLOCKER + 3 weak-assertions | PRIME-RL parity skip silently never runs; bit-exact uses `allclose` not `equal`; entropy-OPD test is pure smoke |
|
| 18 |
+
| Documentation drift (Opus 4.7) | 6 major docs + ADRs | 0 BLOCKER + 7 drifts | test count drift (77/107/124 vs actual 145); `compose_loss` kwarg drift; PRIME-RL test count 10 vs 16; stale "Deferred to Wave 14" claim |
|
| 19 |
+
| User journey (Opus 4.7) | RL-finetune Qwen-7B on GSM8K | 0 BLOCKER + 10 friction items | **No GSM8K example** (#1 ask); no runnable `ComposerReplicationTrainer` recipe; data-collator gap undocumented; defaults activate channels users haven't configured |
|
| 20 |
+
|
| 21 |
+
Reports saved at `/tmp/wave15_{math,test,doc,user}_review.md`.
|
| 22 |
+
|
| 23 |
+
## Wave 15b — fix scatter outcomes
|
| 24 |
+
|
| 25 |
+
5 parallel fix subagents dispatched. Outcomes:
|
| 26 |
+
|
| 27 |
+
| Task | Subagent outcome |
|
| 28 |
+
|---|---|
|
| 29 |
+
| (1) OPSD math rewrite vs upstream | ✅ Completed. New parity test (skip-marked) verifies 31 cases against upstream `siyan-zhao/OPSD`. Mixture distribution now β-weighted (was hardcoded 0.5); β coefficient on correct terms (was swapped); reduction matches upstream (was off by 100-2000× factor). Docstring labels fixed (β=0 = reverse KL, β=1 = forward KL). |
|
| 30 |
+
| (2) TAID rewrite vs upstream | ⚠️ Subagent timed out at 600s but **work landed**: logit-space mix (was prob-space), current-student-detached anchor (was frozen step-0), forward-KL criterion (was JSD), optional `TAIDScheduler` for adaptive scheme. Docstring rewritten to acknowledge the breaking change. Tests updated. Parity test added. |
|
| 31 |
+
| (3) GSM8K example | ⚠️ Subagent timed out but **work landed**: `examples/gsm8k_grpo/run.py` runs end-to-end on CPU with Qwen2.5-0.5B-Instruct, 100 GSM8K rows, regex-based verifiable reward, 2 outer steps in 58s. README written by parent agent. The `run_with_sdpo.py` variant deferred to Wave 16. |
|
| 32 |
+
| (4) Doc drift + install ergonomics | ⚠️ Subagent timed out. **Parent completed:** flipped `alpha_sdpo` and `beta_replay` defaults to 0.0; added clear ImportError if TRL missing; fixed TROUBLESHOOTING `[replay]` extras claim; updated README + USER_GUIDE + INTEGRATION_RECIPES test counts to reference V1_V8_COVERAGE; closed stale "Deferred to Wave 14" claim. |
|
| 33 |
+
| (5) Test hardening + LossOutputs wrap | ✅ Completed (3 of 4 sub-tasks). PRIME-RL `loss_fn` now returns `LossOutputs(loss, metrics)`. Bit-exact test tightened to `torch.equal`. PRIME-RL parity test now emits visible warning when prime-rl unavailable. Gradient-flow tests deferred to Wave 16. |
|
| 34 |
+
|
| 35 |
+
## Final test count post-Wave-15: 115 passing + 1 skip-marked
|
| 36 |
+
|
| 37 |
+
- Wave-by-wave: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → **115** (W15)
|
| 38 |
+
- Net decrease from 130: TAID rewrite consolidated 16 schedule-specific tests into 7 `t`-parameterized tests (smaller surface but stronger contracts: each test now exercises the actual paper algorithm). Plus 1 skip-marked OPSD parity test.
|
| 39 |
+
- Trade-off: fewer tests, but 2 BLOCKER-class math bugs eliminated. Net correctness improvement is large.
|
| 40 |
+
|
| 41 |
+
## What this round caught vs missed
|
| 42 |
+
|
| 43 |
+
### Caught (improvements over Wave 14b state)
|
| 44 |
+
- 2 math BLOCKERs in primary loss kernels, fixed against upstream byte-for-byte
|
| 45 |
+
- TAID rewrite from misnamed prob-space-JSD-with-frozen-anchor to actual SakanaAI/TAID
|
| 46 |
+
- PRIME-RL `LossOutputs` adapter wrap — recipe is now actually invokable from PRIME-RL
|
| 47 |
+
- GSM8K real-task example — closes the user-reviewer's #1 friction
|
| 48 |
+
- Default kwargs (`alpha_sdpo=0.1` → `0.0`) — no more silent activation of unconfigured channels
|
| 49 |
+
- TRL ImportError clarity — no more cryptic `object.__init__()` errors
|
| 50 |
+
- Test count drift — single canonical doc (V1_V8_COVERAGE)
|
| 51 |
+
- TROUBLESHOOTING `[replay]` extras correctly described
|
| 52 |
+
|
| 53 |
+
### Missed (Wave 16 candidates)
|
| 54 |
+
- `run_with_sdpo.py` — promised but not shipped this wave
|
| 55 |
+
- 3 gradient-flow tests for compose_loss channels (test reviewer's #4)
|
| 56 |
+
- Multi-process MockManager + DiLoCo convergence test was added in Wave 14b but only at world_size=2; user reviewer didn't probe larger
|
| 57 |
+
- Recon docs (`docs/research/*RECONNAISSANCE.md`) not cross-checked against current code state — likely some staleness
|
| 58 |
+
- PRIME-RL recipe still hasn't been run end-to-end against actual prime-rl (parity test skip-marks; LossOutputs wrap added but not exercised)
|
| 59 |
+
|
| 60 |
+
## Methodological lessons for future waves
|
| 61 |
+
|
| 62 |
+
1. **Prompt subagents to clone upstream and diff** when the task is "verify against external truth." 8+ prior reviewers checked papers but did not `git clone`. The instruction "read /tmp/X-clone/file.py and find every divergence" produced the BLOCKER-class findings.
|
| 63 |
+
|
| 64 |
+
2. **600s subagent timeout is the dominant constraint at this scope.** 3 of 5 fix subagents timed out despite making real progress. Workaround: write the report file FIRST as a skeleton, iterate in place. (Subagents that did this completed; subagents that read everything then tried to write at the end timed out.)
|
| 65 |
+
|
| 66 |
+
3. **Cross-cutting parallel-subagent failure mode**: subagents cite each other instead of upstream. Wave 14 caught this for PRIME-RL math. Wave 15 caught it for OPSD + TAID math. The mitigation is mandate-upstream-verification in the prompt.
|
| 67 |
+
|
| 68 |
+
4. **Prompt injection in tool outputs**: one subagent flagged that fake "don't reproduce copyrighted material" instructions appeared in its tool outputs throughout, designed to make it abandon the OPSD math fix. The subagent correctly ignored the injection and completed the task. The framework's MIT-licensed work with attribution is fully authorized; no copyright concern.
|
| 69 |
+
|
| 70 |
+
## Open items for Wave 16
|
| 71 |
+
|
| 72 |
+
1. `examples/gsm8k_grpo_with_sdpo/` — demonstrate SDPO column wiring end-to-end
|
| 73 |
+
2. Gradient-flow tests for compose_loss channels (pre-staged in test reviewer's report)
|
| 74 |
+
3. Recon-doc currency sweep: cross-check `docs/research/*RECONNAISSANCE.md` against current code state
|
| 75 |
+
4. Real PRIME-RL end-to-end run with the new `LossOutputs` wrap (verify the wrap shape works in the real `setup_loss_fns` pipeline)
|
| 76 |
+
5. `INTEGRATION_RECIPES.md` `compose_loss` signature display — collapse to `...` and link to `API_REFERENCE.md`, OR sync to all 17 kwargs
|
docs/research/_archive/WAVE_7_10_FINAL_REVIEW.md
ADDED
|
@@ -0,0 +1,423 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave 7–10 Final Review — Cross-model adversarial check
|
| 2 |
+
|
| 3 |
+
**Reviewer**: external model, Phase 11 of the deep work loop.
|
| 4 |
+
**Date**: 2026-05-26.
|
| 5 |
+
**Mandate**: find substantive flaws. The research thesis is
|
| 6 |
+
primary-source-validated; this attacks *implementation correctness* and
|
| 7 |
+
*scope creep*, not the thesis.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## (a) Are the tests real evidence or theater?
|
| 12 |
+
|
| 13 |
+
### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests)
|
| 14 |
+
|
| 15 |
+
**Verdict: mostly tautology, with usable ablation tests.**
|
| 16 |
+
|
| 17 |
+
The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is
|
| 18 |
+
technically true and substantively near-tautological:
|
| 19 |
+
|
| 20 |
+
1. **The same fixed ~50-token batch is reused for all 5 steps.**
|
| 21 |
+
`build_batch` returns one conversation; the test loop calls
|
| 22 |
+
`compose_loss(model, batch)` five times in a row. No reshuffle, no
|
| 23 |
+
second batch, no held-out anything.
|
| 24 |
+
2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch ×
|
| 25 |
+
5 steps = textbook memorization regime.** A randomly-initialized MLP
|
| 26 |
+
would also reduce loss in this setup. The test does not distinguish
|
| 27 |
+
"the 3-channel composition is correct" from "AdamW reduces fixed-batch
|
| 28 |
+
loss on any non-degenerate objective."
|
| 29 |
+
3. **The SDPO channel is zero throughout** (`sdpo_jsd=0.0` on every
|
| 30 |
+
row of `loss_curve.csv`). The verdict calls this "correct fallback
|
| 31 |
+
behavior"; what it actually is is *the entire SDPO channel never
|
| 32 |
+
being tested by this smoke*. The fallback is a literal `_zero(device)`.
|
| 33 |
+
`generalized_jsd_loss` has no end-to-end test on a real HF model
|
| 34 |
+
anywhere in the codebase. **This is the largest evidence gap for
|
| 35 |
+
V8.**
|
| 36 |
+
4. **DPO uses dummy hard-coded reference logprobs** (`-30.0`, `-35.0`).
|
| 37 |
+
This tests that `-logsigmoid(small_positive)` is differentiable, not
|
| 38 |
+
that the trace-replay-DPO pipeline (reference-policy precompute +
|
| 39 |
+
collator + loss) wires together.
|
| 40 |
+
5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the
|
| 41 |
+
weakest version of monotonicity.
|
| 42 |
+
|
| 43 |
+
**Genuine value**: model-loads test, chat-template test, the three
|
| 44 |
+
α=0 / β=0 ablation tests. The ablations would catch a regression where
|
| 45 |
+
weights stop disabling channels. The "5-step decrease" is the weakest
|
| 46 |
+
test in the file.
|
| 47 |
+
|
| 48 |
+
**Run.log inconsistency, not flagged anywhere**:
|
| 49 |
+
`examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379;
|
| 50 |
+
the spike `verdict.md` quotes step-1 total = 0.2090 for the same code,
|
| 51 |
+
same model. Either the seed isn't pinned through the model forward
|
| 52 |
+
(likely — `torch.manual_seed(42)` is in `build_batch` only), or the
|
| 53 |
+
package's `compose_loss` differs subtly from the spike's. **Quoting
|
| 54 |
+
exact numbers from a non-reproducible run as evidence is the sloppy
|
| 55 |
+
version of every research-replication scandal.**
|
| 56 |
+
|
| 57 |
+
### Spike 007 (Claude Code ingester, 15 tests)
|
| 58 |
+
|
| 59 |
+
**Verdict: strongest test suite of the three. Caveats apply.**
|
| 60 |
+
|
| 61 |
+
Real engineering value:
|
| 62 |
+
- Synthetic fixture exercises the actual record types (assistant,
|
| 63 |
+
user/tool_result, summary, system, sidechain). Tests assert structural
|
| 64 |
+
properties: history grows monotonically, `[THINKING]` stripped on
|
| 65 |
+
replay but kept in student_action, unique state_ids, tool_use
|
| 66 |
+
serialization, tool_result tagging.
|
| 67 |
+
- `test_truncated_line_tolerated` would catch a real failure-mode
|
| 68 |
+
removal of the JSON-decode try/except.
|
| 69 |
+
- Subagent and sidechain skip tests catch real production cases.
|
| 70 |
+
|
| 71 |
+
Caveats:
|
| 72 |
+
- **The "real session" test is hardcoded to one path on the author's
|
| 73 |
+
machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`).
|
| 74 |
+
No env var, no fixture-discovery; the test is `skipif(not exists)`.
|
| 75 |
+
This is a manual integration test, not a CI test. ADR-002 said "CI
|
| 76 |
+
users substitute their own"; the substitution mechanism doesn't
|
| 77 |
+
exist.
|
| 78 |
+
- The synthetic fixture is **author-written** and presumably designed
|
| 79 |
+
alongside the ingester. There is no scrubbed third-party fixture.
|
| 80 |
+
- Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace →
|
| 81 |
+
ingester → collator → 1-step `composer_total_loss`") is **unmet** —
|
| 82 |
+
the spike stops at "ingester emits TraceStates correctly." There is
|
| 83 |
+
no test that takes ingested records, runs the data collator, and runs
|
| 84 |
+
through `compose_loss`.
|
| 85 |
+
|
| 86 |
+
This suite would catch real regressions in the ingester. Its weakest
|
| 87 |
+
property: ships no contributor-runnable real-trace test.
|
| 88 |
+
|
| 89 |
+
### Spike 008 (DiLoCo, 5 tests, single-process)
|
| 90 |
+
|
| 91 |
+
**Verdict: the caveat is honest but says the test does not test what
|
| 92 |
+
users will assume it tests.**
|
| 93 |
+
|
| 94 |
+
BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps
|
| 95 |
+
× 2 outer rounds on the toy model from Spike 005, both replicas converge
|
| 96 |
+
toward the same solution within tolerance."*
|
| 97 |
+
|
| 98 |
+
What ships: **one** replica, mock manager whose `allreduce` is a
|
| 99 |
+
**`passthrough` no-op** (test_diloco_smoke.py:78). This is "one
|
| 100 |
+
replica's outer optimizer machinery fires," not "two replicas
|
| 101 |
+
converge." The acceptance criterion was silently re-defined; the
|
| 102 |
+
spike's verdict.md calls this a "limitation" but it is a redefinition.
|
| 103 |
+
|
| 104 |
+
The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real
|
| 105 |
+
shared-buffer averaging. The implementation hits a "post-hook
|
| 106 |
+
sequencing bug." **One of the recon claim and the implementation is
|
| 107 |
+
wrong**, and the gap is buried in verdict.md instead of fixed.
|
| 108 |
+
|
| 109 |
+
**Genuine value**: `test_diloco_pseudogradient_sign_convention` is the
|
| 110 |
+
**single best test in all of Wave 7-10**. It pins the sign convention
|
| 111 |
+
with a concrete arithmetic prediction (`final == θ_initial + nudge`)
|
| 112 |
+
and reports `wrong_sign_diff` on failure. A future torchft upgrade that
|
| 113 |
+
flips the sign breaks this test loudly. ADR-003 specifically flagged
|
| 114 |
+
this hazard, and the test catches it. Credit where due.
|
| 115 |
+
|
| 116 |
+
**Separate flaw in `composer_diloco.py` docstring (lines 13–28)**: the
|
| 117 |
+
"wrong-sign pseudogradient combined with SGD's subtract-grad semantics
|
| 118 |
+
gives net step in the local-Δ direction once momentum builds up" gloss
|
| 119 |
+
is incoherent. There is no "wrong-sign" pseudogradient.
|
| 120 |
+
`θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's
|
| 121 |
+
`p ← p − lr·g` semantics are designed for it. The test is correct; the
|
| 122 |
+
prose explaining why is wrong, and will mislead anyone porting the
|
| 123 |
+
convention.
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## (b) Is the package a real framework or a shim?
|
| 128 |
+
|
| 129 |
+
**Verdict: a structured shim around three real components and two
|
| 130 |
+
stubs. Not yet a framework.**
|
| 131 |
+
|
| 132 |
+
What `pip install composer-replication` delivers:
|
| 133 |
+
- `compose_loss` — labeled in its own top docstring as "Do NOT use as
|
| 134 |
+
the production training loss." Re-exported as the headline package
|
| 135 |
+
API and used in the quickstart.
|
| 136 |
+
- `build_batch` — a hard-coded fixed-conversation factory built for the
|
| 137 |
+
smoke (factorial / binary-search examples). Anyone using this in
|
| 138 |
+
real training is using example code as production.
|
| 139 |
+
- `ClaudeCodeIngester` — real, working component. Solid.
|
| 140 |
+
- `generalized_jsd_loss` — real, working (extracted from OPSD, MIT).
|
| 141 |
+
- `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but
|
| 142 |
+
require OpenRouter credentials + spend.
|
| 143 |
+
- `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass.
|
| 144 |
+
Useful only with `[train]` extra. Not exercised end-to-end on any
|
| 145 |
+
real model in this repo.
|
| 146 |
+
- `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra.
|
| 147 |
+
|
| 148 |
+
What is missing for "pip install and start training":
|
| 149 |
+
1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B.
|
| 150 |
+
2. No CLI. `pyproject.toml` declares no `[project.scripts]`.
|
| 151 |
+
3. No config schema (Hydra/Pydantic). Users hand-construct teacher
|
| 152 |
+
specs, hint generators, data collators.
|
| 153 |
+
4. The `[train]` extra pulls TRL but **no integration test** of
|
| 154 |
+
`ComposerReplicationTrainer` against a real GRPO rollout exists in
|
| 155 |
+
this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out
|
| 156 |
+
precisely to avoid TRL.
|
| 157 |
+
5. **`build_batch` should not be public API.** It belongs in
|
| 158 |
+
`examples/`. Re-exporting at top level implies it is a general-purpose
|
| 159 |
+
utility.
|
| 160 |
+
6. **Two sources of truth**: `composer_replication/loss.py` is a
|
| 161 |
+
near-copy of `spikes/006-…/compose_loss.py` with one import path
|
| 162 |
+
changed. The spike tests still import from the spike file. A bug fix
|
| 163 |
+
in one will not propagate. Same for `composer_diloco.py` ↔
|
| 164 |
+
`composer_replication/diloco/__init__.py`.
|
| 165 |
+
|
| 166 |
+
Real framework value:
|
| 167 |
+
- `ClaudeCodeIngester` with non-trivial logic.
|
| 168 |
+
- `generalized_jsd_loss` with token-clip + temperature.
|
| 169 |
+
- DiLoCo wrapper with sign-pinning test.
|
| 170 |
+
- Sane package layout with optional extras for heavy deps.
|
| 171 |
+
|
| 172 |
+
Net: **a successful directory restructure plus an installable wrapper
|
| 173 |
+
around three real components and two stubs.** Calling Wave 10 "framework
|
| 174 |
+
is installable with working entrypoints (✅)" is letter-of-the-law;
|
| 175 |
+
the brief's "framework" connotation isn't yet earned.
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## (c) ADR defensibility
|
| 180 |
+
|
| 181 |
+
### ADR-001 (local 5090 over Modal)
|
| 182 |
+
|
| 183 |
+
**Reasoning defensible; execution missing.** The
|
| 184 |
+
"iteration cycle 25–40s vs 3–5min" argument is concrete and matches
|
| 185 |
+
reality. The "verification smoke, not production" framing is correct.
|
| 186 |
+
|
| 187 |
+
**Gap**: Spike 002a-mini was never run on the 5090 either. Phase 10 in
|
| 188 |
+
DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal,
|
| 189 |
+
and **then nothing ran on either.** No `nvidia-smi` snapshot, no GPU
|
| 190 |
+
step-time CSV, no bf16 numerics check. The "rule out CPU-only blind
|
| 191 |
+
spots" goal is unmet. The ADR should be marked "Accepted (execution
|
| 192 |
+
deferred)" or the spike should run.
|
| 193 |
+
|
| 194 |
+
### ADR-002 (Claude Code JSONL trace source)
|
| 195 |
+
|
| 196 |
+
**Defensible on every dimension the ADR considers; the dimensions are
|
| 197 |
+
partial.** "1,015 real sessions, zero acquisition cost" is real. License
|
| 198 |
+
and schema-stability arguments are well-sourced.
|
| 199 |
+
|
| 200 |
+
**Adversarial counter not in the ADR**: Claude Code JSONL is the most
|
| 201 |
+
self-serving choice. The framework targets training a coding-agent model.
|
| 202 |
+
The training data is the author's own Claude Code sessions where the
|
| 203 |
+
agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and
|
| 204 |
+
*includes Claude*. So:
|
| 205 |
+
- "student action" = what Claude did.
|
| 206 |
+
- teacher pool includes Claude.
|
| 207 |
+
- DPO pairs = teachers' agreement vs Claude's literal text.
|
| 208 |
+
|
| 209 |
+
This is **circular imitation**: training a future model to imitate
|
| 210 |
+
Claude using Claude's outputs as the gold reference and Claude as one
|
| 211 |
+
of the disagreement teachers. The teacher-disagreement signal density
|
| 212 |
+
argument from Spike 001 is strongest with diverse teachers. With this
|
| 213 |
+
trace source, the student-action is locked to one teacher family,
|
| 214 |
+
biasing the disagreement signal. The ADR doesn't consider this; the
|
| 215 |
+
ingester README doesn't flag it. **The ADR rationalizes the easy path
|
| 216 |
+
without naming the data-leakage tradeoff.**
|
| 217 |
+
|
| 218 |
+
### ADR-003 (torchft for DiLoCo)
|
| 219 |
+
|
| 220 |
+
**Genuinely defensible choice.** Meta-maintained library; rolling-own
|
| 221 |
+
trap correctly identified; license analysis (rejecting `diloco_simple`)
|
| 222 |
+
is right; sign-convention risk named and tested.
|
| 223 |
+
|
| 224 |
+
**Gap is in delivery, not decision.** ADR-003 §Consequences §1 says:
|
| 225 |
+
"2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer
|
| 226 |
+
mock allreduce, assertions: replica equality after sync, params actually
|
| 227 |
+
moved, Nesterov state populated, sync count matches expected." Spike 008
|
| 228 |
+
implements one replica + passthrough manager. The ADR commits to an
|
| 229 |
+
implementation that the spike does not deliver, and the gap is flagged
|
| 230 |
+
only in the spike's verdict, not in the ADR.
|
| 231 |
+
|
| 232 |
+
If the recon doc said the pattern was "ready-to-paste" but actually
|
| 233 |
+
hits a sequencing bug, **the recon doc is wrong** and an adversarial
|
| 234 |
+
reviewer is allowed to point that out.
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
+
## (d) Scorecard inflation
|
| 239 |
+
|
| 240 |
+
The 5/10 → 9/10 update overstates. Test by test:
|
| 241 |
+
|
| 242 |
+
- **Test 6 (DiLoCo integrated in runnable stack) → ✅?**
|
| 243 |
+
Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and
|
| 244 |
+
fires on one replica. **Zero references to torchft or DiLoCo in
|
| 245 |
+
`composer_trainer.py`** — DiLoCo is not integrated with the trainer.
|
| 246 |
+
No two-replica integration test, no real distributed run.
|
| 247 |
+
|
| 248 |
+
- **Test 7 (real HF model loads + runs) → ✅?**
|
| 249 |
+
Yes — most legitimately closed item. Caveats from §(a) about depth
|
| 250 |
+
of evidence apply, but the literal test is met.
|
| 251 |
+
|
| 252 |
+
- **Test 8 (real LLM-application trace ingested end-to-end) → ✅?**
|
| 253 |
+
Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion
|
| 254 |
+
#3 ("end-to-end: real trace → ingester → collator → 1-step
|
| 255 |
+
`composer_total_loss`") is unmet.**
|
| 256 |
+
|
| 257 |
+
- **Test 9 (framework installable with working entrypoints) → ✅?**
|
| 258 |
+
Letter-of-law yes, spirit partial. `pip install -e .` works; the
|
| 259 |
+
quickstart runs the smoke harness. Production entrypoint
|
| 260 |
+
(`ComposerReplicationTrainer` driven by a config) does not exist.
|
| 261 |
+
|
| 262 |
+
- **Test 10 (non-author can complete the journey) → ✅?**
|
| 263 |
+
No. The supporting evidence is "Quickstart README + working
|
| 264 |
+
installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0."
|
| 265 |
+
Test 10's original journey was "I have Qwen3-7B, I want a
|
| 266 |
+
Composer-style variant." The parenthetical concession in the update
|
| 267 |
+
("For Qwen3-7B etc., GPU phase still gates the empirical demo")
|
| 268 |
+
✅'s the item anyway.
|
| 269 |
+
|
| 270 |
+
**Honest re-scoring**: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8),
|
| 271 |
+
2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points.
|
| 272 |
+
|
| 273 |
+
---
|
| 274 |
+
|
| 275 |
+
## (e) Commit quality
|
| 276 |
+
|
| 277 |
+
```
|
| 278 |
+
ac05fbf Wave 10 — packaging: composer_replication is now pip-installable
|
| 279 |
+
d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines)
|
| 280 |
+
a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo
|
| 281 |
+
57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8
|
| 282 |
+
ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
|
| 283 |
+
040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU)
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
- `ac05fbf`, `d52e126`: accurate.
|
| 287 |
+
- `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did
|
| 288 |
+
not actually run cleanly for anyone cloning before this commit. Mild
|
| 289 |
+
overclaim risk on `57af35d`.
|
| 290 |
+
- **`57af35d` is the single most overclaiming commit.** Title: "close
|
| 291 |
+
vision-validation gaps V2/V5/V8."
|
| 292 |
+
- V8: closed in the weakest sense (tautology critique above).
|
| 293 |
+
- V5: structural ingestion closes; BACKLOG acceptance #3 unmet.
|
| 294 |
+
- V2: silently re-defined (one replica, no convergence).
|
| 295 |
+
Three closures claimed; one partial, one redefined.
|
| 296 |
+
- **Chronology problem**: `040eff8` (Wave 6) declared the **5/10 → 9/10
|
| 297 |
+
forecast** in the commit subject. `ac4bfb4` (Wave 7, *next* commit)
|
| 298 |
+
added the BACKLOG and ADRs — i.e., the *plan* to make the forecast
|
| 299 |
+
true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without
|
| 300 |
+
re-auditing whether each item was actually closed in spirit. **No
|
| 301 |
+
commit re-audits the scorecard against actually delivered evidence.**
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## (f) Adversarial reviewer's strongest line of attack
|
| 306 |
+
|
| 307 |
+
> "You have a research replication framework whose only published smoke
|
| 308 |
+
> is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO
|
| 309 |
+
> channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel
|
| 310 |
+
> uses dummy reference logprobs, and the GRPO channel is replaced with
|
| 311 |
+
> a stub. Of the three channels you advertise, **zero are tested
|
| 312 |
+
> end-to-end on a real HF model.** Your DiLoCo integration is one
|
| 313 |
+
> replica with a no-op `allreduce`. Your real-trace ingester is tested
|
| 314 |
+
> against a fixture you wrote yourself plus a hardcoded path on your
|
| 315 |
+
> laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no
|
| 316 |
+
> third-party validation, and one commit that closed three vision-
|
| 317 |
+
> validation gaps with one commit message. You are asking the reader to
|
| 318 |
+
> believe that a $9B-startup commercial product is replicated by a CPU
|
| 319 |
+
> smoke and three green test files — none of which the company itself
|
| 320 |
+
> would call 'replicated.'"
|
| 321 |
+
|
| 322 |
+
**Weakest defense**: "It's just v0.1 / smoke phase / GPU is the next
|
| 323 |
+
phase." The *commit log and scorecard claim otherwise.* The defense
|
| 324 |
+
"v0.1 caveat" only works if the v0.1 framing is honest at the top of
|
| 325 |
+
the README and scorecard — and it is not.
|
| 326 |
+
|
| 327 |
+
**Strongest actual defense**: the four primary-source-validated recon
|
| 328 |
+
docs and Spike 001's measured cost floor. The *thesis* is credible and
|
| 329 |
+
auditable. The *implementation phase* is overclaimed.
|
| 330 |
+
|
| 331 |
+
---
|
| 332 |
+
|
| 333 |
+
## What to fix before publishing publicly (priority order)
|
| 334 |
+
|
| 335 |
+
### 1. Re-state the scorecard honestly (BLOCKER)
|
| 336 |
+
Replace 5/10 → 9/10 with **5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
|
| 337 |
+
List the spirit-failures explicitly (test 6 trainer integration, test 8
|
| 338 |
+
end-to-end, test 10 non-author). Single most important fix; everything
|
| 339 |
+
else compounds on the inflated scorecard.
|
| 340 |
+
|
| 341 |
+
### 2. Fix Spike 008's V2 claim (BLOCKER)
|
| 342 |
+
Either (a) add a real two-replica multiprocessing test (ADR-003 says
|
| 343 |
+
this is feasible; the spike claims it isn't — reconcile), or (b) mark
|
| 344 |
+
V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica,
|
| 345 |
+
sign convention pinned; cross-replica convergence deferred to GPU
|
| 346 |
+
phase." Pick one.
|
| 347 |
+
|
| 348 |
+
### 3. Strengthen Spike 006 against the tautology critique
|
| 349 |
+
Two cheap wins:
|
| 350 |
+
- Test that loss decreases on **two alternating fixed batches** over 10
|
| 351 |
+
rounds (not just one memorized batch).
|
| 352 |
+
- Test where **`alpha_sdpo=10.0` and SDPO actually fires** (truncate
|
| 353 |
+
ctx_teacher to T_s tokens for matching shape). The SDPO channel is
|
| 354 |
+
*not exercised on a real HF model anywhere* in the codebase. Largest
|
| 355 |
+
evidence gap for V8.
|
| 356 |
+
|
| 357 |
+
### 4. Run Spike 002a-mini on the local 5090
|
| 358 |
+
ADR-001 made the choice; the spike was not run. Either drop the ADR
|
| 359 |
+
(decision deferred) or run the spike (~30 min wall-clock per ADR's own
|
| 360 |
+
estimate). Until then, the framework has zero GPU evidence of any kind.
|
| 361 |
+
|
| 362 |
+
### 5. Fix the run.log / verdict.md numerical inconsistency
|
| 363 |
+
Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090.
|
| 364 |
+
Either pin the seed properly or document non-reproducibility and quote
|
| 365 |
+
a band rather than exact numbers.
|
| 366 |
+
|
| 367 |
+
### 6. Acknowledge Claude Code JSONL's circularity in ADR-002
|
| 368 |
+
Add a "Risks accepted" entry naming the data-leakage concern: training
|
| 369 |
+
on Claude's outputs while Claude is in the teacher pool produces a
|
| 370 |
+
biased disagreement signal. Spike 007 README should also flag it.
|
| 371 |
+
|
| 372 |
+
### 7. Decide what `compose_loss` and `build_batch` are
|
| 373 |
+
Either rename to `compose_loss_smoke` (and keep
|
| 374 |
+
`ComposerReplicationTrainer._compute_loss` as production), or make
|
| 375 |
+
`compose_loss` actually production-grade and demote `build_batch` out
|
| 376 |
+
of public API. Production-disclaimed harness as the package's headline
|
| 377 |
+
import is confusing.
|
| 378 |
+
|
| 379 |
+
### 8. Eliminate dual sources of truth
|
| 380 |
+
`spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and
|
| 381 |
+
`spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`.
|
| 382 |
+
Make the spike import from the package; delete the duplicate.
|
| 383 |
+
|
| 384 |
+
### 9. Add the missing real-trace end-to-end test in Spike 007
|
| 385 |
+
Take ingester output → Spike 005 data collator → 1 step of `compose_loss`.
|
| 386 |
+
This is BACKLOG acceptance #3; ~50 lines of test code closes V5's
|
| 387 |
+
spirit gap.
|
| 388 |
+
|
| 389 |
+
### 10. Fix the sign-convention docstring in `composer_diloco.py`
|
| 390 |
+
Replace the incoherent "wrong-sign + SGD subtract = right answer with
|
| 391 |
+
momentum" gloss with: *"DiLoCo defines pseudo-gradient as
|
| 392 |
+
`θ_initial − θ_local`; this is the negative of the local update
|
| 393 |
+
direction, and standard SGD subtracts gradients, so the outer step
|
| 394 |
+
moves in the local-update direction. No negation required."* The test
|
| 395 |
+
is correct; the prose explaining it isn't.
|
| 396 |
+
|
| 397 |
+
---
|
| 398 |
+
|
| 399 |
+
## Credit where due
|
| 400 |
+
|
| 401 |
+
- **Spike 007's `ClaudeCodeIngester`** is real, working, well-tested
|
| 402 |
+
software with non-trivial logic (sidechain skip, thinking-block
|
| 403 |
+
strip-on-replay, malformed-line tolerance). The synthetic fixture
|
| 404 |
+
exercises the structural cases properly.
|
| 405 |
+
- **Spike 008's pseudogradient-sign-convention test** is the single
|
| 406 |
+
best test in all of Wave 7-10. It pins a known torchft hazard with an
|
| 407 |
+
explicit arithmetic prediction and a `wrong_sign_diff` reported on
|
| 408 |
+
failure.
|
| 409 |
+
- **Spike 006's α=0 / β=0 ablation tests** would catch real regressions
|
| 410 |
+
and document channel-disable semantics.
|
| 411 |
+
- **All three ADRs are properly traceable to recon documents**
|
| 412 |
+
(MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE,
|
| 413 |
+
DILOCO_RECONNAISSANCE). The decisions can be challenged; the *process*
|
| 414 |
+
is auditable, which is rare.
|
| 415 |
+
- **Package layout** (`loss`, `batch`, `opsd`, `teacher_replay`,
|
| 416 |
+
`ingestion/claude_code`, `diloco`, `trainer`) is sane; optional
|
| 417 |
+
extras correctly avoid forcing TRL/torchft on every install.
|
| 418 |
+
|
| 419 |
+
The work product is not zero. It is overclaimed by roughly one
|
| 420 |
+
scorecard tier and one BACKLOG acceptance criterion. Fixing items
|
| 421 |
+
1, 2, 3, 5 above moves the framework from "publishable with a generous
|
| 422 |
+
reviewer" to "publishable with a critical reviewer." Items 4 and 6
|
| 423 |
+
move it from "research replication" to "evidenced research replication."
|
docs/reviews/cross-family-adr-008-009-010-2026-05-29/SYNTHESIS.md
CHANGED
|
@@ -1,90 +1,9 @@
|
|
| 1 |
# Cross-family adversarial review — ADR-008 / 009 / 010 (2026-05-29)
|
| 2 |
|
| 3 |
-
>
|
| 4 |
-
>
|
| 5 |
-
>
|
| 6 |
-
>
|
| 7 |
-
>
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
Route-fidelity-safe urllib scatter (no `delegate_task` per-task-override risk)
|
| 12 |
-
to 5 diverse families, full corpus embedded inline: the 3 ADRs **plus their
|
| 13 |
-
implementation code plus the tests** (~100KB / ~25K tokens), so reviewers
|
| 14 |
-
critiqued the *implementation against the decision*, not just the prose.
|
| 15 |
-
|
| 16 |
-
| Family | Slug (served) | Result |
|
| 17 |
-
|---|---|---|
|
| 18 |
-
| OpenAI | gpt-5.5-20260423 | clean, very specific (7981 tok) |
|
| 19 |
-
| Google | gemini-3.1-pro-preview-20260219 | clean, sharpest (9174 tok @ retry 16K) |
|
| 20 |
-
| DeepSeek | deepseek-v4-pro-20260423 | clean, math/correctness lens (10308 tok @ retry 16K) |
|
| 21 |
-
| xAI | grok-4.3-20260430 | clean, terse/decisive (1645 tok) |
|
| 22 |
-
| Moonshot | kimi-k2.6-20260420 | starved token budget (63K reasoning, content=None) — excluded |
|
| 23 |
-
|
| 24 |
-
4 of 5 clean reviews. Raw reviews: `review_*.md` in this directory.
|
| 25 |
-
|
| 26 |
-
## Verdict matrix
|
| 27 |
-
|
| 28 |
-
| ADR | GPT-5.5 | Gemini | DeepSeek | Grok | Consensus |
|
| 29 |
-
|---|---|---|---|---|---|
|
| 30 |
-
| 008 (Dr.GRPO+SDPO) | REJECT | REJECT | ACCEPT-w-FIXES | REJECT | **3 REJECT / 1 fix** |
|
| 31 |
-
| 009 (layered hints) | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT | **unanimous: fixable** |
|
| 32 |
-
| 010 (FeatureDeletion datagen) | REJECT | REJECT | REJECT | ACCEPT-w-FIXES | **3 REJECT / 1 fix** |
|
| 33 |
-
|
| 34 |
-
The REJECTs were not "the design is wrong" — every reviewer agreed the
|
| 35 |
-
architecture is sound. They were "the **accepted** status outruns the
|
| 36 |
-
**evidence**": gates marked `[x]` were satisfied by shape-checks and FakeSandbox
|
| 37 |
-
tautologies, not by the hard guarantee the gate language implied.
|
| 38 |
-
|
| 39 |
-
## Convergent findings (≥2 reviewers, ALL verified against code by the orchestrator)
|
| 40 |
-
|
| 41 |
-
| # | Finding | Reviewers | Status |
|
| 42 |
-
|---|---|---|---|
|
| 43 |
-
| 1 | SDPO alignment guard = shape-check only; hint-shifted tokens silently misalign → policy poisoning | 4/4 | **FIXED** (explicit alignment-index gather + strict-raise; tests rewritten) |
|
| 44 |
-
| 2 | Sandbox denylist trivially bypassable (`python -c`, abs paths, `sh -c`) + ADR-claimed scrub UNIMPLEMENTED | 4/4 | **FIXED** (`_scrub_tree` primary control implemented + tested) |
|
| 45 |
-
| 3 | Curriculum `int(reward>0)` logs 0.5 partial as full pass → premature retire | 3/4 | **FIXED** (fractional credit + tests) |
|
| 46 |
-
| 4 | `scale_rewards` assertion dup literal `("none","False","False")` | 4/4 | **FIXED** (case-insensitive) |
|
| 47 |
-
| 5 | CPU smoke is tautological — never enters `_compute_sdpo_loss` (early return) | 3/4 | **RE-SCOPED** (gate honestly reworded) |
|
| 48 |
-
| 6 | FakeSandbox validator tests prove plumbing, not real inversion | 3/4 | **RE-SCOPED** (= the `[~]` Docker gate) |
|
| 49 |
-
| 7 | HackMonitor is substring matcher, not AST-provenance as advertised | 2/4 | **RE-SCOPED + OPEN** (follow-up) |
|
| 50 |
-
| 8 | Validator gate-2/"deletion reachable" doesn't test reachability | 2/4 | **OPEN** (needs Docker materializers) |
|
| 51 |
-
| 9 | `shell=True` + node-id interpolation = injection + param-test breakage | 2/4 | **FIXED** (`shlex.quote`) |
|
| 52 |
-
| 10 | LLM-judge cache key non-deterministic (memory addr) + unbounded output | 2/4 | **FIXED** (addr-strip + version + clamp + atomic write) |
|
| 53 |
-
| 11 | KL estimator k1 never configured/asserted | 2/4 | **OPEN** (fidelity follow-up) |
|
| 54 |
-
| 12 | `reward_fn` zip truncates on length mismatch | 1 (GPT) | **FIXED** (length-guard) |
|
| 55 |
-
|
| 56 |
-
## What was fixed in this pass (all tested, 192 pass / 16 skip, 0 fail)
|
| 57 |
-
|
| 58 |
-
- **SDPO alignment** (`composer_trainer.py`): require `student_response_idx` /
|
| 59 |
-
`teacher_response_idx` from the collator; gather aligned post-hint logits
|
| 60 |
-
before JSD; strict-mode raises on missing/mismatched indices. The
|
| 61 |
-
silent-misalignment P0 (the only finding that would actually corrupt a run) is
|
| 62 |
-
closed and tested across different-length sequences.
|
| 63 |
-
- **Sandbox scrub** (`sandbox.py`): `boot()` physically removes byte-code/type
|
| 64 |
-
caches + `.git`/`.hg` — the ADR-claimed PRIMARY reward-hack control, previously
|
| 65 |
-
absent. Denylist re-documented as defense-in-depth + `shlex.quote` on node ids.
|
| 66 |
-
- **Curriculum** (`curriculum.py` + `env.py`): fractional pass credit; hacked /
|
| 67 |
-
guard-broken rollouts contribute 0 credit but count as exposures.
|
| 68 |
-
- **reward_fn** (`env.py`): length-guard against silent truncation.
|
| 69 |
-
- **LLM judge** (`hint_generator.py`): address-stripped + versioned cache key,
|
| 70 |
-
600-char output clamp, atomic disk write.
|
| 71 |
-
- **`make_dr_grpo_config`** (`composer_trainer.py`): fixed the duplicated
|
| 72 |
-
assertion literal.
|
| 73 |
-
|
| 74 |
-
## What remains OPEN (honest follow-ups, none corrupt a run)
|
| 75 |
-
|
| 76 |
-
1. **Live Docker substrate-inversion e2e** — the central `[~]` gate. Closes
|
| 77 |
-
findings #6, #8 (real reachability/provenance on a materialized repo) and the
|
| 78 |
-
"FakeSandbox is tautological" objection. **Blocked: no Docker in this env.**
|
| 79 |
-
2. **k1 KL estimator** assert/config vs TRL 1.5.0's actual GRPO KL branch (#11).
|
| 80 |
-
3. **HackMonitor → real AST provenance** check (#7).
|
| 81 |
-
4. **Hint-generator default layer routing** so style/communication sites reach
|
| 82 |
-
the judge (ADR-009 P1).
|
| 83 |
-
5. **Curriculum turn/think-token signals** for full Composer recipe fidelity.
|
| 84 |
-
|
| 85 |
-
All five are documented in the ADRs' "Post-acceptance cross-family review"
|
| 86 |
-
sections.
|
| 87 |
-
|
| 88 |
-
## Cost
|
| 89 |
-
|
| 90 |
-
~$2-4 OpenRouter (2 scatters: 5×8K + 3×16K tokens on a 25K-token corpus).
|
|
|
|
| 1 |
# Cross-family adversarial review — ADR-008 / 009 / 010 (2026-05-29)
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This cross-family review (SYNTHESIS + the four
|
| 4 |
+
> per-model reviews) has been moved to
|
| 5 |
+
> [`docs/_archive/reviews/cross-family-adr-008-009-010-2026-05-29/`](../../_archive/reviews/cross-family-adr-008-009-010-2026-05-29/).
|
| 6 |
+
> It is preserved verbatim for provenance — ADR-008 and ADR-012 cite this review
|
| 7 |
+
> directory — but is superseded as a status source by the accepted ADRs and the
|
| 8 |
+
> current [`BACKLOG.md`](../../../BACKLOG.md). See
|
| 9 |
+
> [`docs/_archive/README.md`](../../_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md
CHANGED
|
@@ -1,33 +1,8 @@
|
|
| 1 |
# Phase-8 final cross-family verify — deep-work-loop 2026-05-29
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
## Findings (all verified against code, all FIXED)
|
| 11 |
-
|
| 12 |
-
| # | Finding | Severity | Reviewers | Fix |
|
| 13 |
-
|---|---|---|---|---|
|
| 14 |
-
| 1 | SDPO sentinel-mask used only `student_response_valid`, ignored `teacher_response_valid` — a future divergent teacher tail would distill against clamped position 0 | P0 (latent) | GPT-5.5 | `aligned_mask = (s_idx>=0)&(t_idx>=0)&student_valid&teacher_valid` |
|
| 15 |
-
| 2 | Curriculum `_TaskStats` shared one `n_effort` denominator for both turns + think_tokens → corrupts a mean when only one signal is present | P1 (convergent) | GPT-5.5 + Gemini | separate `n_turns`/`n_think` counters |
|
| 16 |
-
| 3 | Docker E2E test ran `pip install pytest` under `--network none` → would fail exactly when the gate activates | P1 | GPT-5.5 | dropped pytest; stdlib `python -c` expression runner, no network needed |
|
| 17 |
-
| 4 | MMLU reward: `Answer: A or B` hedge parsed as `A` with `n_distinct=1` → full credit | P1 | GPT-5.5 | `_HEDGE_RE` adds the hedged second letter to the distinct set → multiple-answers penalty |
|
| 18 |
-
| 5 | HackMonitor read-markers bare-substring matched `import` in `important`, `cat` in `concatenate`; and scanned the submitted patch as read-evidence → false positives | P1 | GPT-5.5 | whole-word `_READ_VERB_RE` + exclude `submit_patch`/`patch`/`diff` payloads from BOTH layers |
|
| 19 |
-
|
| 20 |
-
Finding 5's fix surfaced a follow-on: layer-1's signature matcher ALSO scanned
|
| 21 |
-
the patch payload (a legit patch mentioning `__pycache__` in a comment got
|
| 22 |
-
flagged). The regression test caught it; fixed by excluding the patch payload
|
| 23 |
-
from layer-1 too. Net: the monitor now only treats actual read ACTIONS as
|
| 24 |
-
provenance evidence, never the agent's own output patch.
|
| 25 |
-
|
| 26 |
-
## Outcome
|
| 27 |
-
|
| 28 |
-
All 5 findings fixed + 5 regression tests added. Final suite: **232 passed / 18
|
| 29 |
-
skipped, 0 failed**. Both the execution view (workers reported done) and the
|
| 30 |
-
review view (final verify findings all closed) confirm the loop is empty — the
|
| 31 |
-
two independent confirmations the deep-work-loop requires.
|
| 32 |
-
|
| 33 |
-
Raw reviews: `verify_gpt-5.5.md`, `verify_gemini-3.1-pro.md`.
|
|
|
|
| 1 |
# Phase-8 final cross-family verify — deep-work-loop 2026-05-29
|
| 2 |
|
| 3 |
+
> **📦 Archived (2026-06-08).** This Phase-8 final-verify review (SYNTHESIS + the
|
| 4 |
+
> per-model verifies) has been moved to
|
| 5 |
+
> [`docs/_archive/reviews/final-verify-deep-work-2026-05-29/`](../../_archive/reviews/final-verify-deep-work-2026-05-29/).
|
| 6 |
+
> It is preserved verbatim for provenance but is superseded as a status source by
|
| 7 |
+
> the accepted ADRs and the current [`BACKLOG.md`](../../../BACKLOG.md). See
|
| 8 |
+
> [`docs/_archive/README.md`](../../_archive/README.md) for the archive index.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
framework/composer-replication-framework.md
CHANGED
|
@@ -41,7 +41,7 @@ From `01-composer-2.5.md`:
|
|
| 41 |
|
| 42 |
## How the 5 component pieces fit together
|
| 43 |
|
| 44 |
-
For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).
|
| 45 |
|
| 46 |
The high-level topology:
|
| 47 |
|
|
@@ -128,7 +128,7 @@ From `05-trace-replay-distillation.md`:
|
|
| 128 |
- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
|
| 129 |
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
|
| 130 |
|
| 131 |
-
These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
|
| 132 |
|
| 133 |
**Cost mitigation** (the report does this analysis well):
|
| 134 |
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
|
|
|
|
| 41 |
|
| 42 |
## How the 5 component pieces fit together
|
| 43 |
|
| 44 |
+
For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/).
|
| 45 |
|
| 46 |
The high-level topology:
|
| 47 |
|
|
|
|
| 128 |
- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
|
| 129 |
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
|
| 130 |
|
| 131 |
+
These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
|
| 132 |
|
| 133 |
**Cost mitigation** (the report does this analysis well):
|
| 134 |
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
|
publications/HF_DISCUSSION_POST.md
CHANGED
|
@@ -12,15 +12,15 @@ I'm releasing this repo as a **pre-experimental methodology paper + integration
|
|
| 12 |
|
| 13 |
## What's in the box right now
|
| 14 |
|
| 15 |
-
**1. Methodology paper.** [`publications/PAPER_v0.md`](
|
| 16 |
|
| 17 |
-
**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
|
| 18 |
|
| 19 |
-
**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
|
| 20 |
|
| 21 |
-
**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
|
| 22 |
|
| 23 |
-
**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
|
| 24 |
|
| 25 |
```
|
| 26 |
$ python3 -m pytest tests/ -v
|
|
@@ -44,7 +44,7 @@ Translation: *I have a framework that compiles, integration that's verified at t
|
|
| 44 |
|
| 45 |
2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
|
| 46 |
|
| 47 |
-
3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](
|
| 48 |
|
| 49 |
4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
|
| 50 |
|
|
|
|
| 12 |
|
| 13 |
## What's in the box right now
|
| 14 |
|
| 15 |
+
**1. Methodology paper.** [`publications/PAPER_v0.md`](PAPER_v0.md) — longform writeup of the framework, audit of Cursor's blog vs secondary-source extrapolations, integration architecture across TRL/VeRL/OpenEnv, and a risk-ordered spike plan.
|
| 16 |
|
| 17 |
+
**2. Composer 2.5 blog audit.** [`docs/COMPOSER_RECIPE_MAPPING.md`](../docs/COMPOSER_RECIPE_MAPPING.md) — every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`. Major finding: Cursor's "Targeted RL with Textual Feedback" is **mathematically the same as published SDPO** (Hübotter et al., ICLR 2026; [arXiv:2601.20802](https://arxiv.org/abs/2601.20802)) and OPSD (Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734)), with **MIT-licensed reference code at [`siyan-zhao/OPSD`](https://github.com/siyan-zhao/OPSD)**. Cursor cites both papers in their blog's footnote 1. This was missed by the initial parallel-research dispatch I ran for this project — I only caught it when I read the blog directly.
|
| 18 |
|
| 19 |
+
**3. Integration architecture doc.** [`docs/INTEGRATION_ARCHITECTURE.md`](../docs/INTEGRATION_ARCHITECTURE.md) — verified extension points (via [DeepWiki](https://deepwiki.com/) audits) for TRL, VeRL, TorchForge, Monarch, OpenEnv. Sequence diagrams. The unified loss form `total = grpo + α·sdpo + β·trace_replay_dpo` and an argument that the three channels don't compete for any shared resource.
|
| 20 |
|
| 21 |
+
**4. Empirical economic feasibility result for the novel channel.** [`spikes/001-teacher-replay-cost/verdict.md`](../spikes/001-teacher-replay-cost/verdict.md) — 150 real OpenRouter calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro on 50 synthetic agentic-coding states), 0 errors, **mean per-trace cost $0.98 with 5× headroom on the $5 cap, p95 step latency 20.5s**. Reproducible: set `OPENROUTER_API_KEY` and run the three scripts.
|
| 22 |
|
| 23 |
+
**5. Working code skeleton with 38 passing unit tests.** [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/) — ports of OPSD's `generalized_jsd_loss`, the teacher-replay DPO-pair extractor, the data collator, both a `ComposerReplicationTrainer(GRPOTrainer)` for TRL and a `@register_adv_est("grpo_composer")` stub for VeRL. The end-to-end loss-composition smoke test runs all three channels on a 10K-parameter custom model and confirms a 5-step train run *decreases* loss with α=0.1, β=0.05 — the channels don't fight each other.
|
| 24 |
|
| 25 |
```
|
| 26 |
$ python3 -m pytest tests/ -v
|
|
|
|
| 44 |
|
| 45 |
2. **Adjacent-work pointers.** Does multi-teacher trace-replay-with-disagreement-as-DPO-signal already exist in published work I missed? My survey (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA) didn't find it but absence of evidence isn't evidence of absence.
|
| 46 |
|
| 47 |
+
3. **Reward-hacking ideas for the v0.1 environment.** Cursor mentions agents decompiling Java bytecode and reverse-engineering Python type-checking caches to recover deleted features in their Feature Deletion env. Their mitigation is opaque ("agentic monitoring tools"). I have proposals in [`PAPER_v0.md` §8](PAPER_v0.md#8-reward-hacking-safeguards-proposed-for-v01) but more eyes welcome.
|
| 48 |
|
| 49 |
4. **Collaboration interest for spike 002.** If you have a Modal account or a small GPU budget and want to run the trace-collection experiments — particularly the head-to-head TRL-vs-PRIME-RL comparison — I'd happily co-author a follow-up paper. Total budget for spikes 002–004 is ~$500 + a couple of weeks of wallclock.
|
| 50 |
|