Wave 16: install ergonomics + gradient evidence + SDPO end-to-end example

Five-track polish wave with mandatory cross-model adversarial review
(deepseek-v4-pro + gemini-3.1-pro-preview, route-fidelity-verified via
direct-urllib scatter; reviewers caught 2 real bugs that BOTH made it
into the staged diff and would have shipped without the review):

Wave 16a — Install ergonomics (extras pinning)
pyproject.toml dropped three extras whose lower bounds were
unsatisfiable on public PyPI and would have blocked any fresh user:
- monarch>=0.4.1: PyPI 'monarch' tops out at 0.1.11 and is unrelated
to Meta's actor framework; the real Monarch ships as
torchmonarch-nightly with platform constraints.
- prime-rl>=0.5: 'prime-rl' is not registered on PyPI at all; Prime
Intellect publishes from source only.
- data-juicer>=1.0: 'data-juicer' is not registered on PyPI; the
closest match (py-data-juicer) has broken transitive deps.
All three integrations import lazily, so the framework still loads
cleanly without them. docs/TROUBLESHOOTING.md gets a new section #10
with from-source install pointers.
README.md keyword list trimmed of three misleading discovery facets
(prime-rl / monarch / torchforge) flagged by Gemini reviewer.
Verified: `uv pip install -e ".[diloco,replay,replaysim,train,dev]"`
succeeds on a fresh venv.

Wave 16b — Gradient-flow tests (4 new tests, all PASS)
composer_replication/tests/test_gradient_flow.py verifies that
compose_loss channels actually route gradients to model parameters
under autograd, AND that disabled channels produce zero side-effects:
- test_alpha_sdpo_routes_grad_to_params (alpha=1.0 → finite |grad|>0)
- test_beta_replay_routes_grad_to_params (beta=1.0 → finite |grad|>0)
- test_alpha_zero_blocks_sdpo_grad (alpha=0 with vs without SDPO
inputs produces BIT-IDENTICAL parameter gradients — catches a
class of phantom-gradient leak from disabled channels)
- test_taid_grad_flows_through_sdpo_path (Wave 15 TAID rewrite
remains differentiable under autograd)
Test count: 176 → 180 passed / 8 skipped (all green; 1 flaky
serverless test pre-exists Wave 16, not introduced here).

Wave 16c — INTEGRATION_RECIPES.md signature collapse
Removed the 20-line `def compose_loss(...)` parameter reproduction
in the TL;DR; replaced with a single cross-link to API_REFERENCE.md.
Per-recipe USE examples (literal kwarg values demonstrating each
recipe's specific configuration) preserved — those carry the
recipe-specific value-add, not signature drift.
API_REFERENCE.md gets an explicit `<a id="compose_loss"></a>` anchor
so the cross-link resolves cleanly (Gemini reviewer flagged the
GFM-auto-slug as brittle).

Wave 16d — Reconnaissance-doc currency audit
WAVE_16_RECON_AUDIT.md catalogs every code reference in 7 research
docs against the actual symbols, file paths, and kwargs in the
package today. Audit table buckets findings as KEEP / FIX / FLAG /
DELETE. Bucket totals: 24 KEEP + 2 FIX (in RL_FRAMEWORKS_LANDSCAPE)
+ 5 FLAG (proposal-shaped sketches that don't match what was built;
inline `` markers added) + 0 DELETE. Most
consequential FLAG: the DPOPair shape mismatch in REPLAYSIM_
NORMALIZATION_RECONNAISSANCE.md — the sketched _to_dj/_from_dj
round-trip won't work as written against the realised TypedDict.

Wave 16e — examples/gsm8k_grpo_with_sdpo/ (real model, CPU, 16.5s)
New sibling to examples/gsm8k_grpo/ that demonstrates the SDPO
hint-distillation column firing end-to-end on a real Qwen2.5-0.5B-
Instruct, on CPU. Three acceptance assertions verify the channel
actually fires:
✓ sdpo_jsd > 0 at every step (min=0.1358, max=0.1429)
✓ total != lm_ce at every step (channel actually contributes)
✓ |grad| > 0 and finite at every step (autograd flows correctly)
Total loss decreased 5.98 → 2.46 across 5 SGD steps in 16.5s wall-
clock after a 1.7s model load. Uses `tokenizer.apply_chat_template`
with Qwen's actual ChatML markers (<|im_start|>/<|im_end|>) — the
initial draft used raw <|system|>/<|end|> strings which DeepSeek
reviewer correctly flagged would tokenize as 11 punctuation tokens
feeding the model nonsense. Fixed before commit; SDPO signal is
now ~10× stronger (0.1429 vs 0.0326) because the model sees real
ChatML.

Cross-model review (mandatory pre-push, route-fidelity-verified)
Used the model-roster skill's scatter-via-urllib pattern (PR#2's
delegate_task per-task override is unreliable for some families).
Both reviewers received the staged diff with no orchestrator
reasoning leaked:
- deepseek/deepseek-v4-pro (math + test honesty + spec drift):
caught the ChatML marker bug + tokenizer-alignment concern
+ bit-exact-grad-equality fragility note + 3 minor nits.
Cost: $0.060, 211s.
- google/gemini-3.1-pro-preview (user journey + docs consistency):
caught the README keyword leak + the brittle GFM anchor + the
docs/plans/wave-16-plan.md hygiene nit. Cost: $0.075, 35s.
Both BLOCKERs and 2 important-issues fixed before this commit. Two
reviewer "BLOCKERs" verified false-positive (LossComponents.detached
exists; remaining alpha_sdpo refs in INTEGRATION_RECIPES are
intentional USE examples per the 16c task spec).

Methodological note for Wave 17
The cross-model-review step caught two ship-blockers (README keyword
leak + ChatML marker bug) that no amount of orchestrator self-review
surfaced. The Wave 7-11 lesson — adversarial review is mandatory
before any public push, even when the work feels clean — held again
this wave. The temptation to skip it after the final-push budget
pressure was real; this commit message exists in part as the audit
trail for why we didn't.

Files changed (15) hide show

.gitignore +6 -0
README.md +0 -3
composer_replication/tests/test_gradient_flow.py +279 -0
docs/API_REFERENCE.md +1 -0
docs/INTEGRATION_RECIPES.md +11 -20
docs/TROUBLESHOOTING.md +74 -0
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md +11 -0
docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md +14 -0
docs/research/RL_FRAMEWORKS_LANDSCAPE.md +12 -3
docs/research/SELF_DISTILLATION_LANDSCAPE.md +8 -0
docs/research/TRACE_SOURCE_RECONNAISSANCE.md +7 -0
docs/research/WAVE_16_RECON_AUDIT.md +179 -0
examples/gsm8k_grpo_with_sdpo/README.md +72 -0
examples/gsm8k_grpo_with_sdpo/run.py +288 -0
pyproject.toml +27 -9

.gitignore CHANGED Viewed

@@ -29,6 +29,12 @@ examples/*/checkpoints/
 spikes/*/output/
 spikes/*/checkpoints/
 # Model files (HF native; never commit raw weights to a methodology repo)
 *.safetensors
 *.bin

 spikes/*/output/
 spikes/*/checkpoints/
+# Run logs (re-generated on every run.py invocation)
+examples/*/run.log
+spikes/*/results/run.log
+spikes/*/results/test_strict.log
+spikes/*/results/
 # Model files (HF native; never commit raw weights to a methodology repo)
 *.safetensors
 *.bin

README.md CHANGED Viewed

@@ -14,12 +14,9 @@ tags:
   - grpo
   - dapo
   - diloco
-  - prime-rl
   - openenv
   - trl
   - verl
-  - monarch
-  - torchforge
   - research
   - methodology
 pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"

   - grpo
   - dapo
   - diloco
   - openenv
   - trl
   - verl
   - research
   - methodology
 pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"

composer_replication/tests/test_gradient_flow.py ADDED Viewed

	@@ -0,0 +1,279 @@

+"""Gradient-flow tests for compose_loss channels (Wave 16b).
+Wave 14-15 verified compose_loss returns correct numeric values and that
+channel disables behave correctly. This file closes the gap by verifying
+that gradients actually flow back through each enabled channel and reach
+model parameters when the channel is on, AND that disabled channels
+produce zero side-effects on the autograd graph.
+Coverage:
+    1. test_alpha_sdpo_routes_grad_to_params
+       — alpha_sdpo=1.0 + SDPO inputs => non-zero finite grads on params
+    2. test_beta_replay_routes_grad_to_params
+       — beta_replay=1.0 + DPO inputs => non-zero finite grads on params
+    3. test_alpha_zero_blocks_sdpo_grad
+       — alpha_sdpo=0.0: SDPO inputs present vs absent yields BIT-IDENTICAL
+         param.grad on every parameter (catches phantom-gradient leaks
+         from disabled channels)
+    4. test_taid_grad_flows_through_sdpo_path
+       — sdpo_wrapper="taid", taid_t=0.5 still routes grads through
+         the SDPO channel under autograd
+Same TinyLM scaffold as test_compose_loss_integration.py — no HF / TRL,
+all tests run in milliseconds.
+"""
+from __future__ import annotations
+import math
+import torch
+import torch.nn as nn
+from composer_replication import compose_loss
+# ----------------------------------------------------------------------
+# Tiny LM stand-in (mirrors test_compose_loss_integration.py)
+# ----------------------------------------------------------------------
+class TinyLM(nn.Module):
+    """Minimal nn.Module with HF-style ``model(input_ids=...).logits`` API."""
+    def __init__(self, vocab: int = 32, hidden: int = 16, seed: int = 0):
+        super().__init__()
+        torch.manual_seed(seed)
+        self.embed = nn.Embedding(vocab, hidden)
+        self.fc = nn.Linear(hidden, hidden)
+        self.head = nn.Linear(hidden, vocab)
+    def forward(self, input_ids: torch.Tensor):
+        h = torch.tanh(self.fc(self.embed(input_ids)))
+        logits = self.head(h)
+        class _Out:
+            pass
+        out = _Out()
+        out.logits = logits
+        return out
+# ----------------------------------------------------------------------
+# Batch fixtures (mirror test_compose_loss_integration.py shape)
+# ----------------------------------------------------------------------
+VOCAB = 32
+B = 2
+T = 8
+def _make_inputs(seed: int = 7, *, with_sdpo: bool, with_dpo: bool) -> dict:
+    """Build a deterministic input batch with optional channel inputs.
+    SDPO and DPO inputs can be independently included or excluded so we
+    can exercise the channel-disable code paths cleanly.
+    """
+    g = torch.Generator().manual_seed(seed)
+    inputs: dict[str, torch.Tensor] = {
+        "input_ids": torch.randint(0, VOCAB, (B, T), generator=g),
+        "response_mask": torch.zeros(B, T, dtype=torch.long),
+    }
+    inputs["response_mask"][:, T // 2:] = 1
+    if with_sdpo:
+        inputs["ctx_teacher_input_ids"] = torch.randint(0, VOCAB, (B, T), generator=g)
+        inputs["sdpo_loss_mask"] = torch.zeros(B, T, dtype=torch.long)
+        inputs["sdpo_loss_mask"][:, T // 2:] = 1
+    if with_dpo:
+        inputs["dpo_chosen_input_ids"] = torch.randint(0, VOCAB, (B, T), generator=g)
+        inputs["dpo_chosen_response_mask"] = torch.ones(B, T, dtype=torch.long)
+        inputs["dpo_rejected_input_ids"] = torch.randint(0, VOCAB, (B, T), generator=g)
+        inputs["dpo_rejected_response_mask"] = torch.ones(B, T, dtype=torch.long)
+        inputs["dpo_chosen_ref_logprobs"] = torch.randn(B, generator=g)
+        inputs["dpo_rejected_ref_logprobs"] = torch.randn(B, generator=g)
+    return inputs
+def _grad_norm(model: nn.Module) -> float:
+    """Sum of |grad| across all params with non-None grad."""
+    return sum(
+        p.grad.detach().abs().sum().item()
+        for p in model.parameters()
+        if p.grad is not None
+    )
+def _grad_is_finite(model: nn.Module) -> bool:
+    """All param grads are finite (no inf, no nan)."""
+    for p in model.parameters():
+        if p.grad is None:
+            continue
+        if not torch.isfinite(p.grad).all():
+            return False
+    return True
+def _model() -> TinyLM:
+    """Fresh TinyLM with deterministic init."""
+    return TinyLM(vocab=VOCAB, hidden=16, seed=0)
+# ----------------------------------------------------------------------
+# Test 1 — SDPO channel routes grads to params when alpha_sdpo > 0
+# ----------------------------------------------------------------------
+def test_alpha_sdpo_routes_grad_to_params():
+    """When alpha_sdpo > 0 and SDPO inputs are present, calling
+    out.total.backward() must produce non-zero finite gradients on
+    model parameters.
+    """
+    model = _model()
+    inputs = _make_inputs(with_sdpo=True, with_dpo=False)
+    out = compose_loss(
+        model,
+        inputs,
+        alpha_sdpo=1.0,
+        beta_replay=0.0,
+    )
+    # Sanity: SDPO actually fired (channel is non-zero).
+    assert float(out.sdpo_jsd) != 0.0, (
+        "alpha_sdpo=1.0 with SDPO inputs should produce a non-zero sdpo_jsd; "
+        f"got {float(out.sdpo_jsd)}"
+    )
+    out.total.backward()
+    g = _grad_norm(model)
+    assert g > 0.0, f"Expected non-zero grad sum from SDPO channel; got {g}"
+    assert math.isfinite(g), f"Grad sum is not finite: {g}"
+    assert _grad_is_finite(model), "Some grads are inf/nan"
+# ----------------------------------------------------------------------
+# Test 2 — Replay-DPO channel routes grads to params when beta_replay > 0
+# ----------------------------------------------------------------------
+def test_beta_replay_routes_grad_to_params():
+    """When beta_replay > 0 and DPO inputs are present, backward must
+    produce non-zero finite gradients on model parameters.
+    Note: response_mask is set to all-zeros so the LM-CE channel is
+    exactly zero — any non-zero grad must come from the DPO channel.
+    """
+    model = _model()
+    inputs = _make_inputs(with_sdpo=False, with_dpo=True)
+    # Zero out response_mask so LM-CE contributes nothing — isolates DPO.
+    inputs["response_mask"] = torch.zeros(B, T, dtype=torch.long)
+    out = compose_loss(
+        model,
+        inputs,
+        alpha_sdpo=0.0,
+        beta_replay=1.0,
+    )
+    assert float(out.lm_ce) == 0.0, "LM-CE should be zero with empty response_mask"
+    assert float(out.trace_replay_dpo) != 0.0, (
+        "beta_replay=1.0 with DPO inputs should produce a non-zero "
+        f"trace_replay_dpo; got {float(out.trace_replay_dpo)}"
+    )
+    out.total.backward()
+    g = _grad_norm(model)
+    assert g > 0.0, f"Expected non-zero grad sum from DPO channel; got {g}"
+    assert math.isfinite(g), f"Grad sum is not finite: {g}"
+    assert _grad_is_finite(model), "Some grads are inf/nan"
+# ----------------------------------------------------------------------
+# Test 3 — Disabled SDPO channel produces ZERO side-effects on autograd
+# ----------------------------------------------------------------------
+def test_alpha_zero_blocks_sdpo_grad():
+    """With alpha_sdpo=0.0, providing SDPO inputs vs omitting them must
+    produce bit-identical parameter gradients.
+    This catches a class of bug where a disabled channel leaks a phantom
+    contribution into the autograd graph (e.g. if the SDPO branch ran a
+    forward pass even when alpha=0 and somehow scaled the result by
+    alpha=0 incorrectly).
+    """
+    inputs_with_sdpo = _make_inputs(with_sdpo=True, with_dpo=False)
+    inputs_no_sdpo = _make_inputs(with_sdpo=False, with_dpo=False)
+    # Trial A: SDPO inputs present, alpha=0 — channel should be silent.
+    model_a = _model()
+    out_a = compose_loss(model_a, inputs_with_sdpo, alpha_sdpo=0.0, beta_replay=0.0)
+    out_a.total.backward()
+    grads_a = {
+        name: p.grad.detach().clone() if p.grad is not None else None
+        for name, p in model_a.named_parameters()
+    }
+    # Trial B: SDPO inputs absent, alpha=0.
+    model_b = _model()  # Same seed -> bit-identical init.
+    out_b = compose_loss(model_b, inputs_no_sdpo, alpha_sdpo=0.0, beta_replay=0.0)
+    out_b.total.backward()
+    grads_b = {
+        name: p.grad.detach().clone() if p.grad is not None else None
+        for name, p in model_b.named_parameters()
+    }
+    # Bit-identical grads on every parameter.
+    assert set(grads_a.keys()) == set(grads_b.keys())
+    for name in grads_a:
+        ga, gb = grads_a[name], grads_b[name]
+        if ga is None and gb is None:
+            continue
+        assert ga is not None and gb is not None, (
+            f"Param {name}: grad_a={ga is not None}, grad_b={gb is not None}"
+        )
+        # atol=0, rtol=0 -> bit-exact equality. SDPO inputs with alpha=0
+        # must not perturb the autograd graph by even one ULP.
+        assert torch.equal(ga, gb), (
+            f"Param {name}: disabled SDPO channel leaked phantom gradient. "
+            f"|diff|.max()={float((ga - gb).abs().max())}"
+        )
+# ----------------------------------------------------------------------
+# Test 4 — TAID-wrapped SDPO channel still routes grads under autograd
+# ----------------------------------------------------------------------
+def test_taid_grad_flows_through_sdpo_path():
+    """The Wave 15 TAID rewrite (logit-space mix, current-student-detached
+    anchor) must remain differentiable. With sdpo_wrapper='taid' and
+    taid_t=0.5, backward must produce non-zero finite gradients on
+    model parameters.
+    """
+    model = _model()
+    inputs = _make_inputs(with_sdpo=True, with_dpo=False)
+    out = compose_loss(
+        model,
+        inputs,
+        alpha_sdpo=1.0,
+        beta_replay=0.0,
+        sdpo_wrapper="taid",
+        taid_t=0.5,
+    )
+    assert float(out.sdpo_jsd) != 0.0, (
+        f"taid_t=0.5 should still produce a non-zero sdpo_jsd; "
+        f"got {float(out.sdpo_jsd)}"
+    )
+    out.total.backward()
+    g = _grad_norm(model)
+    assert g > 0.0, (
+        f"Expected non-zero grad sum from TAID-wrapped SDPO channel; got {g}"
+    )
+    assert math.isfinite(g), f"Grad sum is not finite: {g}"
+    assert _grad_is_finite(model), "Some grads are inf/nan"

docs/API_REFERENCE.md CHANGED Viewed

@@ -103,6 +103,7 @@ components.total.backward()
 ```
 ### `compose_loss(model, inputs, *, ...) -> LossComponents`
 ```python
 def compose_loss(

 ```
 ### `compose_loss(model, inputs, *, ...) -> LossComponents`
+<a id="compose_loss"></a>
 ```python
 def compose_loss(

docs/INTEGRATION_RECIPES.md CHANGED Viewed

@@ -55,31 +55,22 @@ total_loss = grpo_loss
 This is implemented once, in
 [`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
 and re-used by every recipe via the kwargs documented in
-[`API_REFERENCE.md`](API_REFERENCE.md). The verified surface is:
 ```python
-def compose_loss(
-    model,
-    inputs,
-    *,
-    alpha_sdpo: float = 0.1,
-    beta_replay: float = 0.05,
-    sdpo_jsd_beta: float = 0.5,
-    sdpo_temperature: float = 1.0,
-    sdpo_token_clip: float | None = None,
-    replay_dpo_beta: float = 0.1,
-    # ADR-007 extensions
-    dpo_variant: Literal["dpo", "simpo"] = "dpo",
-    sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none",
-    taid_t: float | None = None,
-    simpo_beta:  float = 2.0,
-    simpo_gamma: float = 1.0,
-    entropy_opd_h_max: float | None = None,
-) -> torch.Tensor: ...
 ```
 All five recipes below either call `compose_loss` directly or call a
-thin per-framework adapter that forwards these kwargs unchanged.
 ---

 This is implemented once, in
 [`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
 and re-used by every recipe via the kwargs documented in
+[`API_REFERENCE.md`](API_REFERENCE.md). The full signature — including
+all ADR-007 channel-2/3 knobs (`dpo_variant`, `sdpo_wrapper`, `taid_t`,
+`simpo_beta`/`simpo_gamma`, `entropy_opd_h_max`, …) — is the
+single source of truth in
+[API_REFERENCE.md § `compose_loss`](API_REFERENCE.md#compose_loss).
+The conceptual call shape is just:
 ```python
+compose_loss(model, inputs, **kwargs)  # see API_REFERENCE.md#compose_loss for full signature
 ```
 All five recipes below either call `compose_loss` directly or call a
+thin per-framework adapter that forwards these kwargs unchanged. Each
+recipe's **§5 Distillation-loss wiring** documents the kwargs *that
+recipe* uses by default and why; refer back to API_REFERENCE.md for
+defaults, types, and which kwargs are mutually exclusive.
 ---

docs/TROUBLESHOOTING.md CHANGED Viewed

@@ -759,6 +759,80 @@ pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py::test_r
 ---
 ## How to file a bug report
 If you've read the relevant section above and your problem persists,

 ---
+### 10. `monarch` / `data-juicer` / `prime-rl` install (Wave 16)
+**SYMPTOM.** `pip install -e ".[monarch]"`, `pip install -e ".[prime-rl]"`,
+or `pip install -e ".[replaysim]"` fails immediately with a uv/pip
+resolver error similar to:
+```
+× No solution found when resolving dependencies:
+  ╰─▶ Because only monarch<=0.1.11 is available and
+      composer-replication[monarch] depends on monarch>=0.4.1, we can
+      conclude that composer-replication[monarch]'s requirements are
+      unsatisfiable.
+```
+**DIAGNOSIS.** Three upstream packages the framework integrates with are
+not currently pip-installable in their advertised versions:
+1. **Meta's Monarch** is published on PyPI as
+   `torchmonarch-nightly` (nightly wheels with platform constraints), not
+   as `monarch`. The PyPI name `monarch` is unrelated to Meta's actor
+   framework and tops out at `0.1.11`.
+2. **Prime Intellect's prime-rl** is not registered on PyPI at all. It
+   is published from source only.
+3. **data-juicer** is not registered on PyPI under that exact name. The
+   closest match (`py-data-juicer==1.0.0`) has broken transitive deps;
+   newer `py-data-juicer` releases work but install ~150 transitive
+   packages.
+Wave 16 dropped all three extras from `pyproject.toml` rather than ship
+unsatisfiable pins. The framework code paths that touch these libraries
+import them lazily, so:
+- `composer_replication.recipes.monarch` is a documentation skeleton
+  that does NOT require monarch installed.
+- `composer_replication.recipes.prime_rl.composer_loss` imports cleanly
+  without prime-rl; the upstream parity test is `@skipif`-gated and the
+  in-file shadow-parity test still verifies the loss formula
+  independently.
+- `composer_replication.replaysim.normalize.DJNormalizer(skip_dj=True)`
+  works without `data_juicer`; only the full DJNormalizer code path
+  needs it.
+**FIX.** If you want any of these libraries' real functionality, install
+from source alongside the framework:
+```
+# Meta Monarch (actor framework — see ADR-006)
+pip install torchmonarch-nightly       # OR install from source:
+# git clone https://github.com/meta-pytorch/monarch && cd monarch && pip install -e .
+# Prime Intellect prime-rl (Recipe C — see ADR-006)
+git clone https://github.com/PrimeIntellect-ai/prime-rl
+cd prime-rl && pip install -e .
+# data-juicer (replaysim normalization — see ADR-004)
+git clone https://github.com/modelscope/data-juicer
+cd data-juicer && pip install -e .
+```
+**VERIFICATION.** A fresh checkout install with all surviving extras
+should succeed:
+```
+uv venv --clear
+uv pip install -e ".[diloco,replay,replaysim,train,dev]"
+source .venv/bin/activate
+python -m pytest -q                    # baseline 176 passed / 8 skipped
+```
+If any of those extras fails to resolve, file a bug report — Wave 16
+verified the full extras matrix installs from a clean venv on Python
+3.11.
+---
 ## How to file a bug report
 If you've read the relevant section above and your problem persists,

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md CHANGED Viewed

@@ -621,6 +621,17 @@ if __name__ == "__main__":
 ### 3.4 Package layout
 ```
 composer_replication/
 └── diloco/

 ### 3.4 Package layout
+<!-- AUDIT: stale_serverless_layout — ADR-005 shipped a flatter layout than this
+     proposal. Actual modules under composer_replication/diloco/serverless/
+     are: __init__.py, executor.py (ServerlessExecutor + LocalProcessExecutor),
+     allreduce.py (ObjectStoreAllReduce + MockManager), modal.py (ModalExecutor),
+     hf_jobs.py (HFJobsExecutor), replica_entrypoint.py. No leading underscores,
+     no _protocol/_base/_rendezvous split, and Modal/HFJobs are flat modules
+     rather than subpackages. The above code-block file headers (e.g.
+     `_modal_adapter.py`, `_hf_jobs_adapter.py`, `_protocol.py`, `_rendezvous.py`)
+     are pre-implementation proposals; map them to the realised module names
+     when reading. -->
 ```
 composer_replication/
 └── diloco/

docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md CHANGED Viewed

@@ -312,6 +312,20 @@ write_jsonl(out_path, pairs)
 ### 4.3 Adapter shape (`replaysim/normalize.py`)
 ```python
 # composer_replication/replaysim/normalize.py
 from __future__ import annotations

 ### 4.3 Adapter shape (`replaysim/normalize.py`)
+<!-- AUDIT: stale_replaysim_paths_and_dpo_shape — ADR-004 shipped at
+     composer_replication/replaysim/normalize.py with a different DPOPair shape
+     than this sketch. Actual DPOPair is a TypedDict with fields
+     {state_id, state_messages, chosen: str, rejected: str, n_teachers_agreeing}
+     — NOT {prompt, chosen, rejected, state, meta} as in the proposal below. The
+     YAML recipe also lives at composer_replication/recipes/replaysim/default.yaml
+     (not composer_replication/replaysim/recipes/dpo_normalize.yaml). The hook
+     in §4.5 is provided by `replay_and_normalize_trace` in
+     composer_replication/replaysim/__init__.py rather than a drop-in edit to
+     `teacher_replay.py`. The custom op file (§4.4 line 426 / §4.4 line 431)
+     `composer_replication/replaysim/ops/preference_validator.py` was not
+     created. Treat the sketch below as proposal, not as documentation of
+     the realised code. -->
 ```python
 # composer_replication/replaysim/normalize.py
 from __future__ import annotations

docs/research/RL_FRAMEWORKS_LANDSCAPE.md CHANGED Viewed

@@ -313,9 +313,13 @@ group_size = 16
 [trainer]
 algorithm = "grpo"
 [trainer.loss]
 type = "custom"
-import_path = "composer_replication.losses.composer_three_channel_loss"
 [trainer.loss.kwargs]
 hint_weight = 0.5
 replay_weight = 0.25
@@ -330,10 +334,15 @@ sync_mode = "async"
 shardcast = true
 ```
-`composer_replication/losses.py` (~120 LOC):
 ```python
-# composer_replication/losses.py
 from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
 def composer_three_channel_loss(

 [trainer]
 algorithm = "grpo"
+<!-- AUDIT: stale_recipe_format — Wave 14b shipped this as YAML at
+     composer_replication/recipes/prime_rl/prime_rl_config.yaml with a different
+     kwarg surface (alpha_sdpo, beta_dpo, dppo_mask_high, dppo_mask_low, adv_tau,
+     kl_tau). The TOML/hint_weight/replay_weight sketch below predates that. -->
 [trainer.loss]
 type = "custom"
+import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
 [trainer.loss.kwargs]
 hint_weight = 0.5
 replay_weight = 0.25
 shardcast = true
 ```
+`composer_replication/recipes/prime_rl/composer_loss.py` (~120 LOC; current Wave 14b
+implementation defines `loss_fn(inputs, **kwargs)` rather than the
+`composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)`
+signature sketched below):
 ```python
+# composer_replication/recipes/prime_rl/composer_loss.py — sketch only;
+# the actual signature evolved during Wave 14b. See module docstring for
+# the current `loss_fn` contract.
 from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
 def composer_three_channel_loss(

docs/research/SELF_DISTILLATION_LANDSCAPE.md CHANGED Viewed

@@ -352,6 +352,14 @@ license + reproducible scale) to recommend adding right now.
 For ADR-007 the proposed addition is a `composer_replication.distillation`
 sub-package with three pluggable hooks:
 ```
 composer_replication/
   distillation/

 For ADR-007 the proposed addition is a `composer_replication.distillation`
 sub-package with three pluggable hooks:
+<!-- AUDIT: stale_distillation_layout — ADR-007 shipped a flatter layout than
+     this proposal. Actual modules: composer_replication/distillation/{simpo.py,
+     taid.py, entropy_aware_opd.py}. There is no targets.py/losses.py split,
+     no top-level preference/ subpackage, and SimPO lives under distillation/
+     rather than preference/. The function names also differ: actual exports
+     are `simpo_loss`, `taid_loss` + `TAIDScheduler`, and `entropy_aware_opd_loss`
+     (not `taid_target` / `entropy_aware_kl_loss`). -->
 ```
 composer_replication/
   distillation/

docs/research/TRACE_SOURCE_RECONNAISSANCE.md CHANGED Viewed

@@ -244,6 +244,13 @@ For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k
 ## 6. TraceIngester sketch
 Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
 ```python

 ## 6. TraceIngester sketch
+<!-- AUDIT: stale_ingester_paths_and_naming — Spike 007 shipped at
+     spikes/007-real-trace-ingestion/claude_code_ingester.py (NOT
+     spikes/007-trace-ingester/trace_ingester.py) and the production-side
+     module is composer_replication/ingestion/claude_code.py exporting
+     `ClaudeCodeIngester` (NOT `TraceIngester`). The sketch below is the
+     pre-spike proposal; the realised API surface is named differently. -->
 Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
 ```python

docs/research/WAVE_16_RECON_AUDIT.md ADDED Viewed

	@@ -0,0 +1,179 @@

+# Wave 16 Reconnaissance Audit
+Audit of `docs/research/*RECONNAISSANCE.md` and `*LANDSCAPE.md` against repo
+state at sha `e5add150ab06aeef3adda726c0fcae05aa270500`.
+Wave 16d charter: check 7 recon/landscape docs against current code, produce
+this audit table, and apply only **unambiguous** inline fixes. Ambiguous claims
+are tagged with `<!-- AUDIT: stale_<short> -->` HTML comments inline and
+recorded here for orchestrator follow-up.
+## Summary
+- Total claims checked: ~36 across 7 docs
+- KEEP: 24
+- FIX (unambiguous, applied inline): 2
+- FLAG (ambiguous; HTML comment + entry below): 5
+- DELETE: 0
+The framework deliberately ships proposal-shaped recon docs alongside built
+code. Wave 16d's posture has been "do not rewrite proposals; flag where the
+realised code diverged so future readers know which sections are pre-impl
+sketch vs. accurate-as-of-build documentation."
+## Per-doc findings
+### DILOCO_RECONNAISSANCE.md
+External-only doc (about `meta-pytorch/torchft`); no in-repo symbol or path
+references. Nothing to audit against current code.
+| Claim | Status | Action |
+| --- | --- | --- |
+| All references are to `torchft.local_sgd.DiLoCo`, `torchft/local_sgd.py:324`, `torchft/manager.py`, etc. | external library, not our concern | KEEP |
+| Recommends `pip install torchft-nightly`; `composer_replication/diloco/__init__.py` later adopted that | confirmed in code | KEEP |
+### DILOCO_SERVERLESS_RECONNAISSANCE.md
+Doc is a pre-implementation proposal for ADR-005. The realised package layout
+under `composer_replication/diloco/serverless/` is flatter and uses different
+module names than the proposal.
+| Claim | Status | Action |
+| --- | --- | --- |
+| `composer_replication.diloco.serverless` namespace exists | matches reality | KEEP |
+| Code blocks use file headers `_modal_adapter.py`, `_hf_jobs_adapter.py`, `_protocol.py`, `_rendezvous.py`, `_base.py` | actual modules are `modal.py`, `hf_jobs.py`, `executor.py`, `allreduce.py` (no leading underscore, no _protocol/_base/_rendezvous split) | FLAG (`stale_serverless_layout` HTML comment added before §3.4) |
+| §3.4 proposed `modal/`, `hfjobs/`, `runpod/` subpackages | actual ships flat `modal.py`, `hf_jobs.py` modules | FLAG (covered by same comment) |
+| `make_diloco_outer_loop` lives in `composer_replication/diloco/__init__.py` lines 64–125 | line numbers verified (def at 64, body ends ~125) | KEEP |
+| `python -m composer_replication.diloco.serverless.replica_entrypoint` | module exists with `main()` at line 38 | KEEP |
+| `MockManager` exists in serverless package | confirmed at `composer_replication/diloco/serverless/allreduce.py:215` | KEEP |
+| `ObjectStoreAllReduce` exists | confirmed at `composer_replication/diloco/serverless/allreduce.py:30` | KEEP |
+| §3.5 user-facing API `from composer_replication.diloco.serverless import ModalExecutor, HFJobsExecutor, ReplicaSpec` | partial: `ModalExecutor` and `HFJobsExecutor` exist as classes, but `serverless/__init__.py` does NOT re-export them (only `LocalProcessExecutor`, `MockManager`, `ObjectStoreAllReduce`, `ReplicaHandle`, `ServerlessExecutor`) and `ReplicaSpec` is not implemented | covered by `stale_serverless_layout` flag |
+### MODAL_RECONNAISSANCE.md
+Doc anchors all loss-shape claims to `spikes/005-integrated-trainer-skeleton/`,
+which is unchanged since Wave 15. All checked references resolve.
+| Claim | Status | Action |
+| --- | --- | --- |
+| `spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py` exists | confirmed | KEEP |
+| `spikes/005-integrated-trainer-skeleton/opsd_loss.py` exists | confirmed | KEEP |
+| `_compute_sdpo_loss` student/teacher forwards at `composer_trainer.py L138–143` | confirmed (L138-143 are student/teacher logits) | KEEP |
+| `_compute_trace_replay_loss` chosen/rejected forwards at `L191–198` | confirmed (L191-198 are `_sequence_logprobs` calls) | KEEP |
+| Zero-tensor short-circuit at `L136` and `L155` | confirmed (both lines `return torch.tensor(0.0, ..., requires_grad=True)`) | KEEP |
+| `opsd_loss.py L54` references `top_k` arg | confirmed | KEEP |
+| Trainer defaults `alpha_sdpo=0.1`, `beta_replay=0.05` | match `composer_replication/loss.py:75-76` | KEEP |
+### REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md
+Pre-spike proposal for ADR-004. The library was adopted (data-juicer) and
+`DJNormalizer` exists, but the adapter shape and recipe path differ from the
+proposal.
+| Claim | Status | Action |
+| --- | --- | --- |
+| `composer_replication.replaysim` package exists with `DJNormalizer` | confirmed at `composer_replication/replaysim/normalize.py:145` | KEEP |
+| `replay_trace`, `extract_dpo_pairs` re-exported from replaysim | confirmed in `__init__.py` | KEEP |
+| §4.3 sketch shows DPOPair with `{prompt, chosen, rejected, state, meta}` and `_to_dj`/`_from_dj` round-trip | actual `DPOPair` is `{state_id, state_messages, chosen: str, rejected: str, n_teachers_agreeing}` (TypedDict in `composer_replication/teacher_replay.py:99`) | FLAG (`stale_replaysim_paths_and_dpo_shape` added before §4.3) |
+| Recipe path `composer_replication/replaysim/recipes/dpo_normalize.yaml` (line 344) | actual is `composer_replication/recipes/replaysim/default.yaml` | FLAG (covered by same comment) |
+| `composer_replication/replaysim/ops/preference_validator.py` (line 431) | not created — no `ops/` subpackage exists | FLAG (covered by same comment) |
+| §4.5 hook into `composer_replication/replaysim/teacher_replay.py` | actual integration path is `replay_and_normalize_trace_sync` in `composer_replication/replaysim/normalize.py:301` (no separate replaysim/teacher_replay.py — teacher_replay lives at top-level `composer_replication/teacher_replay.py`) | FLAG (covered by same comment) |
+### RL_FRAMEWORKS_LANDSCAPE.md
+Wave 14b parity rewrite changed the public surface for the PRIME-RL recipe.
+Most file/symbol references in the doc are unambiguously fixable.
+| Claim | Status | Action |
+| --- | --- | --- |
+| PRIME-RL `LossInputs` / `LossOutputs` interface, fields `trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask` | matches `composer_replication/recipes/prime_rl/composer_loss.py:17-28` | KEEP |
+| `import_path = "composer_replication.losses.composer_three_channel_loss"` (line 318) | actual is `composer_replication.recipes.prime_rl.composer_loss:loss_fn` | FIX (applied inline) |
+| `composer_replication/losses.py` (~120 LOC) (line 333) | actual file is `composer_replication/recipes/prime_rl/composer_loss.py`; function is `loss_fn` not `composer_three_channel_loss`; signature is `(inputs, **kwargs)` not `(li, *, hint_weight, replay_weight, replay_logits)` | FIX (filename + AUDIT comment noting signature drift; sketch retained as-is) |
+| Recipe in `recipes/composer_v0_prime_rl.toml` with kwargs `hint_weight`, `replay_weight`, `replay_logits_path` | actual recipe is `composer_replication/recipes/prime_rl/prime_rl_config.yaml` (YAML not TOML) and kwargs are `alpha_sdpo, beta_dpo, dppo_mask_high, dppo_mask_low, adv_tau, kl_tau` | FLAG (`stale_recipe_format` HTML comment added at the recipe block) |
+| Monarch sketch `composer_replication/orchestrator/monarch_runner.py` | not in code; treated as v0.2 sketch (and §6.2 already labels it "Monarch wrap-up sketch (v0.2)") | KEEP (clearly v0.2 forward-looking; no AUDIT needed) |
+| Note that `composer_replication/recipes/monarch/actors.py` exists, providing the monarch actor surface | matches `composer_replication/recipes/monarch/actors.py` | KEEP |
+### SELF_DISTILLATION_LANDSCAPE.md
+Audit doc for ADR-007. Three losses (SimPO, TAID, Entropy-Aware OPD) were
+adopted, but the package layout proposed in §"Recommended follow-up wiring"
+is not what was built.
+| Claim | Status | Action |
+| --- | --- | --- |
+| References `composer_replication/__init__.py` and existing `composer_replication.opsd.generalized_jsd_loss` | confirmed | KEEP |
+| Audited candidate verdicts (SimPO/TAID/EA-OPD recommended) | matches what shipped under `composer_replication/distillation/` | KEEP |
+| Proposed package layout: `composer_replication/distillation/{targets.py, losses.py}` + `composer_replication/preference/{simpo.py, dpo.py}` | actual: flat `composer_replication/distillation/{simpo.py, taid.py, entropy_aware_opd.py}` — no targets/losses split, no top-level `preference/` package | FLAG (`stale_distillation_layout` HTML comment added before the proposal block) |
+| Function names `taid_target`, `entropy_aware_kl_loss`, `fixed_target` | actual exports: `taid_loss` + `TAIDScheduler`, `entropy_aware_opd_loss`, `simpo_loss` | FLAG (covered by same comment) |
+| Composition rule sketch `L_distill = entropy_aware_kl_loss(target = taid_target(...), ...)` | not realised as a single composed function — actual API is per-loss with `compose_loss` mixing channels via `sdpo_wrapper`/`dpo_variant` switches | FLAG (covered by same comment) |
+### TRACE_SOURCE_RECONNAISSANCE.md
+Pre-spike audit feeding ADR-002 and Spike 007. Doc cites the actual
+`TraceState`/`DPOPair` TypedDicts correctly (matches current
+`composer_replication/teacher_replay.py:81-104`). The sketch in §6 uses
+spike-shape names that do not match what shipped.
+| Claim | Status | Action |
+| --- | --- | --- |
+| `TraceState` and `DPOPair` field lists in §1 | match `composer_replication/teacher_replay.py` (TypedDicts at lines 81-104) | KEEP |
+| `spikes/005-integrated-trainer-skeleton/teacher_replay.py` exists | confirmed | KEEP |
+| §6 sketch path `spikes/007-trace-ingester/trace_ingester.py` and class `TraceIngester` | actual spike path is `spikes/007-real-trace-ingestion/claude_code_ingester.py` and the production class is `composer_replication.ingestion.claude_code.ClaudeCodeIngester` | FLAG (`stale_ingester_paths_and_naming` HTML comment added before §6) |
+| Direct inspection of `~/.claude/projects/` JSONL files | not testable from CI; user-machine claim | KEEP |
+| Re-use of `TraceState` from spike-005 `teacher_replay.py` | spike still has it; production also has it | KEEP |
+## Open items for Wave 17+
+These are the FLAGged ambiguous claims that need orchestrator decision before
+a confident rewrite:
+1. **DILOCO_SERVERLESS_RECONNAISSANCE.md §3.4** — proposed serverless package
+   layout (`_modal_adapter.py`, `_protocol.py`, etc.) does not match shipped
+   layout (`modal.py`, `executor.py`, etc.). Decide: rewrite §3.4 to document
+   shipped layout, or keep as historical proposal. Note also: §3.5 references
+   `ModalExecutor` and `HFJobsExecutor` as `from … serverless import …` but
+   `serverless/__init__.py` only re-exports `LocalProcessExecutor`. Either
+   the public re-export should be added (code change, out of Wave 16d scope)
+   or §3.5 needs to use the longer module path.
+2. **REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md §4** — Adapter sketch assumes
+   a `DPOPair` shape (`{prompt, chosen, rejected, state, meta}`) that does
+   not match the realised TypedDict (`{state_id, state_messages, chosen: str,
+   rejected: str, n_teachers_agreeing}`). The §4.3 `_to_dj`/`_from_dj`
+   functions in the sketch will not work as written. Decide: rewrite §4 to
+   match `replay_and_normalize_trace_sync` in
+   `composer_replication/replaysim/normalize.py:301`, or keep as proposal-
+   shaped historical context.
+3. **RL_FRAMEWORKS_LANDSCAPE.md §6.1** — recipe sketch is `.toml` with
+   `hint_weight`/`replay_weight` kwargs; reality is `.yaml` with
+   `alpha_sdpo`/`beta_dpo`/`dppo_mask_high`/`dppo_mask_low`/`adv_tau`/`kl_tau`.
+   The `loss_fn(inputs, **kwargs)` signature also differs from the
+   `composer_three_channel_loss(li, *, hint_weight, replay_weight,
+   replay_logits)` sketch. Decide: rewrite §6.1 to match shipped recipe, or
+   keep as the original landscape proposal.
+4. **SELF_DISTILLATION_LANDSCAPE.md §"Recommended follow-up wiring"** — the
+   `distillation/{targets.py, losses.py}` + `preference/{simpo.py, dpo.py}`
+   layout is not what shipped. Actual is flat
+   `composer_replication/distillation/{simpo.py, taid.py, entropy_aware_opd.py}`
+   with function names `simpo_loss`, `taid_loss` + `TAIDScheduler`,
+   `entropy_aware_opd_loss`. Decide: rewrite the wiring sketch or leave as
+   proposal-shaped record of pre-ADR thinking.
+5. **TRACE_SOURCE_RECONNAISSANCE.md §6** — `TraceIngester` sketch differs from
+   shipped `ClaudeCodeIngester`. Decide: rewrite §6 to point at the realised
+   ingester (would also require updating spike path from
+   `007-trace-ingester` to `007-real-trace-ingestion`), or keep as
+   pre-spike proposal.
+## What was NOT changed
+- `WAVE_*_FINAL_REVIEW.md` files — explicitly out of scope for Wave 16d.
+- Any code under `composer_replication/`, `examples/`, `tests/` — code-side
+  fixes (e.g. adding `ModalExecutor`/`HFJobsExecutor` to `serverless/__init__.py`
+  re-exports) belong to a code wave, not a doc audit.
+- Whole-section rewrites of any doc — Wave 16d's mandate is "audit + safe
+  inline fixes only". Each FLAG above is a candidate for a future targeted
+  rewrite wave.

examples/gsm8k_grpo_with_sdpo/README.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# gsm8k_grpo_with_sdpo — SDPO column end-to-end on Qwen2.5-0.5B-Instruct (CPU)
+This is the sibling to `examples/gsm8k_grpo/` that demonstrates the
+**SDPO hint-distillation column** firing end-to-end on a real
+HuggingFace causal LM, on CPU, in ~90 seconds. Where `gsm8k_grpo/run.py`
+runs plain GRPO with `alpha_sdpo=0`, this script enables `alpha_sdpo=0.5`
+and verifies the SDPO channel actually fires and routes gradients
+through the model.
+## Run it
+```bash
+pip install -e ".[train]"
+python examples/gsm8k_grpo_with_sdpo/run.py
+```
+Expected wall-clock: ~60-120s on CPU (one-time HF model download on
+first run).
+## What success looks like
+The script will print 5 SGD steps' worth of channel-decomposed losses
+and end with three ✓ assertions:
+```
+  step 1/5: total=5.9801  lm_ce=5.9087  sdpo_jsd=0.1429  trace_replay_dpo=0.0000  |grad|=6.45e+06
+  step 2/5: total=4.2268  lm_ce=4.1573  sdpo_jsd=0.1390  trace_replay_dpo=0.0000  |grad|=1.20e+06
+  ...
+  step 5/5: total=2.4644  lm_ce=2.3962  sdpo_jsd=0.1363  trace_replay_dpo=0.0000  |grad|=1.03e+06
+[4/4] Verifying SDPO column wiring ...
+  ✓ sdpo_jsd > 0 at every step (min=0.1358, max=0.1429)
+  ✓ total != lm_ce at every step (min |diff|=0.0679, max=0.0714)
+  ✓ |grad| > 0 and finite at every step (min=1.01e+06, max=6.45e+06)
+✅ SDPO column wiring verified end-to-end.
+```
+Wall-clock on the reference run: **16.5s** for 5 SGD steps after a
+1.7s model-load phase (no model download — already cached). The SDPO
+signal magnitude (~0.14) is meaningful here because the script uses
+Qwen's actual ChatML markers (`<|im_start|>` / `<|im_end|>`) via
+`tokenizer.apply_chat_template` — not raw marker strings, which would
+be tokenized as 11 punctuation tokens and the model would see nonsense.
+If `sdpo_jsd` ever shows up as `0.0000`, the SDPO column is silent —
+that means either (a) `alpha_sdpo=0`, (b) `ctx_teacher_input_ids` is
+missing from the input batch, or (c) the data collator is producing
+empty teacher contexts.
+## Why this is not a real training run
+The hint contexts here are hand-crafted — every example gets the same
+generic "remember to verify your arithmetic" hint, and the response
+mask is fabricated to mark the back half of the sequence. This is a
+**code-path smoke test**, not a recipe. Real SDPO training uses a
+`ComposerDataCollator` (see
+`composer_replication.trainer.data_collator`) that generates per-step
+hints from the actual error sites in your trace data.
+## Cross-references
+- [`composer_replication.compose_loss`](../../composer_replication/loss.py) — the loss-composition entrypoint
+- [`docs/COMPOSER_RECIPE_MAPPING.md`](../../docs/COMPOSER_RECIPE_MAPPING.md) — how SDPO maps to Cursor's Composer-2.5 hint-distillation
+- [`docs/adrs/ADR-002-channel2-sdpo.md`](../../docs/adrs/ADR-002-channel2-sdpo.md) — SDPO design decision
+- [`examples/gsm8k_grpo/run.py`](../gsm8k_grpo/run.py) — plain GRPO sibling (alpha_sdpo=0)
+## CPU vs GPU
+This example is intentionally CPU-only and small (B=2, T=32, 5 steps)
+so it exercises the SDPO code path without needing a GPU. For real
+training on Qwen2.5 at scale, switch to
+`ComposerReplicationTrainer` (TRL-backed) and a real GPU; see
+[`docs/USER_GUIDE.md`](../../docs/USER_GUIDE.md) §8 Recipe A.

examples/gsm8k_grpo_with_sdpo/run.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""GRPO + SDPO column wiring on Qwen2.5-0.5B-Instruct (CPU end-to-end).
+This is the sibling to `examples/gsm8k_grpo/` that demonstrates the
+**SDPO hint-distillation column** firing end-to-end on a real
+HuggingFace model, on CPU, without needing TRL's full GRPO training
+loop. Where `gsm8k_grpo/run.py` runs plain GRPO with `alpha_sdpo=0`,
+this script loads the same model and shows that:
+  1. `compose_loss(model, inputs, alpha_sdpo=0.5, ...)` produces a
+     non-zero `sdpo_jsd` channel on a real HF causal LM, and
+  2. backward through that channel reaches model parameters with
+     finite gradients, and
+  3. running 5 SGD steps with the SDPO column enabled reduces the
+     channel-decomposed total loss.
+This is the smallest possible "real-model" SDPO-wiring proof — the
+hand-crafted hint contexts here are not realistic training data, they
+just exercise the SDPO code path. For production SDPO, use
+`ComposerReplicationTrainer` with a `ComposerDataCollator` that emits
+`ctx_teacher_input_ids` / `sdpo_loss_mask` columns from your real
+trace data (see `composer_replication.trainer.data_collator`).
+Usage:
+    pip install -e ".[train]"
+    python examples/gsm8k_grpo_with_sdpo/run.py
+Cross-references:
+  - `composer_replication.compose_loss` — the loss-composition entrypoint
+  - `docs/COMPOSER_RECIPE_MAPPING.md` — how SDPO maps to Cursor's
+    Composer-2.5 hint-distillation
+  - `docs/adrs/ADR-002-channel2-sdpo.md` — SDPO design
+  - `examples/gsm8k_grpo/run.py` — plain GRPO (no SDPO) sibling
+"""
+from __future__ import annotations
+import logging
+import random
+import sys
+import time
+from pathlib import Path
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from composer_replication import compose_loss
+# ---------------------------------------------------------------------------
+# Config
+# ---------------------------------------------------------------------------
+MODEL_REPO = "Qwen/Qwen2.5-0.5B-Instruct"
+N_STEPS = 5
+B = 2                # batch size
+T = 32               # sequence length (small to keep CPU fast)
+LR = 1e-5
+ALPHA_SDPO = 0.5     # SDPO column weight; large enough to dominate the signal
+BETA_REPLAY = 0.0    # DPO column off — this example focuses on SDPO
+OUTPUT_DIR = Path(__file__).resolve().parent / "output"
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+# ---------------------------------------------------------------------------
+# Tiny GSM8K-shaped fixture — we fabricate the chat strings so the model
+# sees realistic prose. The "hint" is the same prompt with a
+# "remember to verify your arithmetic" line inserted; that's what makes
+# the teacher context differ from the student context.
+# ---------------------------------------------------------------------------
+PROBLEMS = [
+    {
+        "question": "Janet has 3 boxes with 4 apples each. How many apples total?",
+        "gold": "12",
+    },
+    {
+        "question": "A train travels 60 miles in 2 hours. What's its average speed?",
+        "gold": "30",
+    },
+]
+SYS = "You are a math tutor. End with `#### N` where N is the answer."
+HINT = "Hint: re-check your arithmetic before giving the final answer."
+def _build_chat_messages(question: str, *, with_hint: bool) -> list[dict]:
+    """Format a single example as chat messages. with_hint=True is the
+    teacher context (hint inserted as an extra system turn). Returns the
+    OpenAI-style messages list, ready for tokenizer.apply_chat_template.
+    Verified 2026-05-26: Qwen2.5 uses ChatML markers (`<|im_start|>` /
+    `<|im_end|>`), NOT `<|system|>` / `<|end|>`. Using
+    `apply_chat_template` is the only safe way to format input — raw
+    marker strings get tokenized as 11 punctuation tokens and the model
+    sees nonsense.
+    """
+    messages = [{"role": "system", "content": SYS}]
+    if with_hint:
+        # Two system turns: Qwen's chat template will format both with
+        # <|im_start|>system / <|im_end|> markers.
+        messages.append({"role": "system", "content": HINT})
+    messages.append({"role": "user", "content": question})
+    return messages
+def build_inputs(tokenizer) -> dict[str, torch.Tensor]:
+    """Tokenize PROBLEMS into a compose_loss-shaped batch.
+    Returns a dict with:
+      - input_ids:              (B, T) student rollouts (no hint)
+      - response_mask:          (B, T)
+      - ctx_teacher_input_ids:  (B, T) hint-conditioned context
+      - sdpo_loss_mask:         (B, T) 1 at assistant-response tokens
+    """
+    student_msg_lists = [_build_chat_messages(p["question"], with_hint=False) for p in PROBLEMS[:B]]
+    teacher_msg_lists = [_build_chat_messages(p["question"], with_hint=True) for p in PROBLEMS[:B]]
+    # Render via Qwen's chat template — produces real <|im_start|>/<|im_end|> tokens.
+    student_strs = [
+        tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
+        for m in student_msg_lists
+    ]
+    teacher_strs = [
+        tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
+        for m in teacher_msg_lists
+    ]
+    s_tok = tokenizer(
+        student_strs,
+        max_length=T,
+        truncation=True,
+        padding="max_length",
+        return_tensors="pt",
+    )
+    t_tok = tokenizer(
+        teacher_strs,
+        max_length=T,
+        truncation=True,
+        padding="max_length",
+        return_tensors="pt",
+    )
+    # Mark the second half of each sequence as the "response" — purely
+    # synthetic for this smoke; in real training the response_mask comes
+    # from the rollout pipeline.
+    response_mask = torch.zeros(B, T, dtype=torch.long)
+    response_mask[:, T // 2:] = 1
+    sdpo_loss_mask = response_mask.clone()
+    return {
+        "input_ids": s_tok["input_ids"],
+        "response_mask": response_mask,
+        "ctx_teacher_input_ids": t_tok["input_ids"],
+        "sdpo_loss_mask": sdpo_loss_mask,
+    }
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main() -> int:
+    random.seed(42)
+    torch.manual_seed(42)
+    log_path = OUTPUT_DIR.parent / "run.log"
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+        handlers=[
+            logging.StreamHandler(sys.stdout),
+            logging.FileHandler(log_path, mode="w"),
+        ],
+    )
+    log = logging.getLogger("gsm8k_grpo_with_sdpo")
+    log.info("=" * 64)
+    log.info("GRPO + SDPO + GSM8K + Qwen2.5-0.5B-Instruct (CPU)")
+    log.info("=" * 64)
+    log.info("[1/4] Loading model + tokenizer ...")
+    t0 = time.time()
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
+    if tokenizer.pad_token_id is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(MODEL_REPO, torch_dtype=torch.float32)
+    model.to("cpu")
+    n_params = sum(p.numel() for p in model.parameters())
+    log.info("  loaded in %.1fs (%.3fB params)", time.time() - t0, n_params / 1e9)
+    log.info("[2/4] Building hint-conditioned batch (B=%d, T=%d) ...", B, T)
+    inputs = build_inputs(tokenizer)
+    for k, v in inputs.items():
+        log.info("  %s: shape=%s, dtype=%s", k, tuple(v.shape), v.dtype)
+    log.info("[3/4] Running %d SGD steps with alpha_sdpo=%.2f ...", N_STEPS, ALPHA_SDPO)
+    optim = torch.optim.SGD(model.parameters(), lr=LR)
+    history: list[dict[str, float]] = []
+    model.train()
+    t0 = time.time()
+    for step in range(N_STEPS):
+        optim.zero_grad()
+        out = compose_loss(
+            model,
+            inputs,
+            alpha_sdpo=ALPHA_SDPO,
+            beta_replay=BETA_REPLAY,
+        )
+        out.total.backward()
+        # Sanity: gradients are finite + non-zero
+        gnorm = sum(
+            p.grad.abs().sum().item()
+            for p in model.parameters()
+            if p.grad is not None
+        )
+        optim.step()
+        components = out.detached()
+        components["grad_norm"] = gnorm
+        history.append(components)
+        log.info(
+            "  step %d/%d: total=%.4f  lm_ce=%.4f  sdpo_jsd=%.4f  trace_replay_dpo=%.4f  |grad|=%.2e",
+            step + 1, N_STEPS,
+            components["total"], components["lm_ce"],
+            components["sdpo_jsd"], components["trace_replay_dpo"],
+            gnorm,
+        )
+    dt = time.time() - t0
+    log.info("Training complete in %.1fs (avg %.1fs/step)", dt, dt / N_STEPS)
+    # ------------------------------------------------------------------
+    # Acceptance assertions — SDPO column actually fired
+    # ------------------------------------------------------------------
+    log.info("[4/4] Verifying SDPO column wiring ...")
+    # 1. SDPO channel was non-zero at every step (channel actually fired)
+    sdpo_values = [h["sdpo_jsd"] for h in history]
+    assert all(s > 0.0 for s in sdpo_values), (
+        f"SDPO column is identically zero — channel did not fire. "
+        f"sdpo_jsd values: {sdpo_values}"
+    )
+    log.info("  ✓ sdpo_jsd > 0 at every step (min=%.4f, max=%.4f)",
+             min(sdpo_values), max(sdpo_values))
+    # 2. total != lm_ce at every step (SDPO actually contributed to total)
+    diffs = [abs(h["total"] - h["lm_ce"]) for h in history]
+    assert all(d > 1e-6 for d in diffs), (
+        f"total ≈ lm_ce at every step — SDPO contribution is negligible. "
+        f"abs(total - lm_ce): {diffs}"
+    )
+    log.info("  ✓ total != lm_ce at every step (min |diff|=%.4f, max=%.4f)",
+             min(diffs), max(diffs))
+    # 3. Gradients were finite + non-zero throughout
+    gnorms = [h["grad_norm"] for h in history]
+    assert all(g > 0.0 for g in gnorms), (
+        f"Some steps had zero gradient norm: {gnorms}"
+    )
+    import math
+    assert all(math.isfinite(g) for g in gnorms), (
+        f"Some steps had non-finite gradient norm: {gnorms}"
+    )
+    log.info("  ✓ |grad| > 0 and finite at every step (min=%.2e, max=%.2e)",
+             min(gnorms), max(gnorms))
+    # ------------------------------------------------------------------
+    # Summary
+    # ------------------------------------------------------------------
+    log.info("=" * 64)
+    log.info("Summary")
+    log.info("=" * 64)
+    log.info("  steps:           %d", N_STEPS)
+    log.info("  alpha_sdpo:      %.2f", ALPHA_SDPO)
+    log.info("  beta_replay:     %.2f", BETA_REPLAY)
+    log.info("  model params:    %.3fB", n_params / 1e9)
+    log.info("  total step 1:    %.4f", history[0]["total"])
+    log.info("  total step %d:    %.4f", N_STEPS, history[-1]["total"])
+    log.info("  wall-clock:      %.1fs", dt)
+    log.info("  log file:        %s", log_path)
+    log.info("=" * 64)
+    log.info("✅ SDPO column wiring verified end-to-end.")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

pyproject.toml CHANGED Viewed

@@ -30,7 +30,6 @@ keywords = [
     "prime-rl",
     "openenv",
     "torchft",
-    "monarch",
     "modal",
     "huggingface-jobs",
 ]
@@ -64,8 +63,16 @@ serverless = [
     "huggingface_hub>=0.27",   # for hf:// fsspec backend + HF Jobs
 ]
 # Replaysim dataset normalization (per ADR-004)
 replaysim = [
-    "data-juicer>=1.0",
     "composer-replication[replay]",   # replaysim builds on the replay channel
 ]
 # Production training (TRL GRPOTrainer subclass — Recipe A)
@@ -76,13 +83,24 @@ train = [
     "datasets>=3.0",
 ]
 # PRIME-RL recipe (Recipe C — per ADR-006)
-prime-rl = [
-    "prime-rl>=0.5",
-]
-# Monarch actor mesh (per ADR-006)
-monarch = [
-    "monarch>=0.4.1",
-]
 # Everything for development
 dev = [
     "pytest>=8.0",

     "prime-rl",
     "openenv",
     "torchft",
     "modal",
     "huggingface-jobs",
 ]
     "huggingface_hub>=0.27",   # for hf:// fsspec backend + HF Jobs
 ]
 # Replaysim dataset normalization (per ADR-004)
+#
+# NOTE: data-juicer is intentionally NOT pinned as an extra. The package
+# named "data-juicer" does not exist on PyPI (the closest match,
+# "py-data-juicer==1.0.0", has broken transitive deps; later py-data-juicer
+# releases work but install ~150 transitive packages). Users who want the
+# DJNormalizer adapter should install data-juicer from source themselves —
+# see docs/TROUBLESHOOTING.md ("monarch / data-juicer install"). The
+# replaysim Python module imports data_juicer lazily, so the framework
+# package imports cleanly without it; only DJNormalizer use-time fails.
 replaysim = [
     "composer-replication[replay]",   # replaysim builds on the replay channel
 ]
 # Production training (TRL GRPOTrainer subclass — Recipe A)
     "datasets>=3.0",
 ]
 # PRIME-RL recipe (Recipe C — per ADR-006)
+# NOTE: a `prime-rl` extra used to be advertised here pinning
+# `prime-rl>=0.5`. That pin is unsatisfiable: the `prime-rl` PyPI name is
+# not registered. Prime Intellect publishes prime-rl from source only
+# (https://github.com/PrimeIntellect-ai/prime-rl). The framework's
+# composer_replication.recipes.prime_rl adapter handles its absence
+# gracefully (the upstream parity test is skip-marked when prime-rl is
+# not importable) and the in-file shadow-parity test still verifies the
+# loss formula independently. The extra is dropped — see
+# docs/TROUBLESHOOTING.md ("prime-rl install") for installation guidance.
+# NOTE: a `monarch` extra used to be advertised here pinning
+# `monarch>=0.4.1`. That pin is unsatisfiable: PyPI's `monarch` package
+# is unrelated to Meta's actor framework and tops out at 0.1.11. The real
+# Meta Monarch is published as `torchmonarch-nightly` and ships only as
+# nightly wheels with platform constraints. Per ADR-006, full Monarch
+# integration is a v0.2+ bet and the `composer_replication.recipes.monarch`
+# module is a documentation skeleton (importing it does NOT require
+# monarch installed). The extra is dropped — see docs/TROUBLESHOOTING.md
+# ("monarch / data-juicer install") for installation guidance.
 # Everything for development
 dev = [
     "pytest>=8.0",