Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports

Three-track follow-up to Wave 16, with cross-model review (gemini-3.1-pro
APPROVED, grok-4.3 surfaced one BLOCKER that verified false-positive
empirically, deepseek-v4-pro burned all max_tokens on hidden reasoning
producing no visible output):

Wave 17a — serverless package re-exports (real code bug)
composer_replication/diloco/serverless/__init__.py now re-exports
ModalExecutor and HFJobsExecutor alongside LocalProcessExecutor. The
package docstring already documented `from composer_replication.diloco.serverless
import ModalExecutor` as the public API but the import would have
failed because the names weren't in __init__.py's import list or
__all__. Both modal.py and hf_jobs.py import-safely (they don't
import the optional `modal` / `huggingface_hub` packages at module
level — those are deferred to class method calls), so adding the
re-exports doesn't introduce a new dependency.

Pinned by a regression test
(test_public_reexports_include_all_executors in
test_serverless_local.py) that asserts every executor adapter the
package docstring documents is in __all__ AND importable.

Wave 16's user-journey reviewer caught this as a BLOCKER. Wave 17a
fixes it.

Wave 17b — SDPO context alignment in the gsm8k_grpo_with_sdpo example
Previous version of build_inputs() tokenized student/teacher
contexts with default right-padding/right-truncation, which dropped
the `<|im_start|>assistant\n` marker for any prompt longer than T=32
(both ours are: student=36, teacher=46). The SDPO mask covered
prompt-area positions instead of the assistant-response area, and
the channel computed JSD over MISALIGNED tokens.

Wave 17b flips both tokenizer.padding_side and tokenizer.truncation_side
to "left" under a try/finally, so:
- inputs shorter than T get LEFT-padded (assistant marker stays at T-1)
- inputs longer than T get LEFT-truncated (drops leading system
turns, keeps the assistant marker on the right)

Verified bit-identical right-most 16 positions across student vs
teacher contexts (the assistant-generation marker + last few prompt
tokens are the common suffix; chat-template appends the same
`<|im_start|>assistant\n` regardless of how many system turns
preceded). Post-fix SDPO signal is 0.0611-0.0642 (vs 0.1358-0.1429
pre-fix); the lower number is the meaningful one because the channel
now computes JSD over actually-aligned positions instead of random
misaligned tokens. Total loss starts at 2.22 (vs 5.98 pre-fix) and
drops to 1.28 over 5 SGD steps in 21.3s on CPU.

This is the alignment discipline production SDPO requires —
composer_replication/trainer/composer_trainer.py:_compute_sdpo_loss
raises a shape-mismatch warning and skips the channel if student/
teacher logits don't match shape, so misalignment in production
silently disables the column.

Wave 17c — close all 5 audit FLAGs from WAVE_16_RECON_AUDIT.md
Each FLAGged proposal section in
docs/research/{DILOCO_SERVERLESS,REPLAYSIM_NORMALIZATION,
RL_FRAMEWORKS,SELF_DISTILLATION,TRACE_SOURCE}_RECONNAISSANCE.md
now has a "Realised in v0.1" blockquote section IMMEDIATELY ABOVE
the historical proposal sketch, documenting:
- The actual public surface (imports + class/function names)
- The actual file paths and module layout
- Constructor signatures and key kwarg names that differ from
the proposal
- Cross-references to the realised tests / verification harnesses

The HTML AUDIT comments are removed (replaced by the blockquote
sections, which serve the same purpose with more useful content).
The historical proposal sketches are preserved verbatim below — they
document the shape of pre-ADR thinking that fed each ADR, which is
valuable archival context.

WAVE_16_RECON_AUDIT.md gets a closeout banner explaining all 5 FLAGs
are resolved + how (proposal docs got Realised-in-v0.1 sections;
FLAG #1's underlying code bug fixed in Wave 17a). The original
FLAG-list table is preserved for archival continuity.

Cross-model review (route-fidelity-verified via direct urllib scatter)
- gemini-3.1-pro-preview-20260219 ($0.049, 27s): APPROVED. Walked
through 2 of the 5 doc rewrites + verified the README expected-
output numbers match what run.py produces + sanity-checked the
closeout banner placement + spot-checked 2 cross-link paths. All
coverage passed clean.
- grok-4.3-20260430 ($0.013, 7s): REQUEST_CHANGES with 1 BLOCKER —
"eager `from .modal import ModalExecutor` at package import time
will raise if `modal` isn't installed". Verified false-positive
empirically: modal.py only imports typing + framework-internal
classes at module level (the actual `modal` package is deferred
to class-method bodies); test environment has modal NOT installed
and `from composer_replication.diloco.serverless import
ModalExecutor` works fine. Per subagent-driven-development.md
"verifying subagent claims": spot-check before applying.
- deepseek-v4-pro-20260423: failed to produce visible output — burned
all 4000 max_tokens on hidden reasoning_tokens. Methodological
note for future reviews: when using DeepSeek-V4-Pro for adversarial
review, set max_tokens >= 6000 to leave room for reasoning + output.

Test count after Wave 17: 177 passed / 2 skipped (was 176/2 in
the same scope; the +1 is the public-reexport regression test).
The example's run.py acceptance assertions pass with the new alignment
numbers.

Files changed (10) hide show

composer_replication/diloco/serverless/__init__.py +4 -0
composer_replication/diloco/serverless/tests/test_serverless_local.py +42 -0
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md +41 -10
docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md +40 -13
docs/research/RL_FRAMEWORKS_LANDSCAPE.md +36 -4
docs/research/SELF_DISTILLATION_LANDSCAPE.md +36 -7
docs/research/TRACE_SOURCE_RECONNAISSANCE.md +29 -6
docs/research/WAVE_16_RECON_AUDIT.md +10 -0
examples/gsm8k_grpo_with_sdpo/README.md +15 -12
examples/gsm8k_grpo_with_sdpo/run.py +59 -24

composer_replication/diloco/serverless/__init__.py CHANGED Viewed

@@ -52,10 +52,14 @@ from composer_replication.diloco.serverless.executor import (
     ReplicaHandle,
     ServerlessExecutor,
 )
 __all__ = [
     "LocalProcessExecutor",
     "MockManager",
     "ObjectStoreAllReduce",
     "ReplicaHandle",
     "ServerlessExecutor",

     ReplicaHandle,
     ServerlessExecutor,
 )
+from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor
+from composer_replication.diloco.serverless.modal import ModalExecutor
 __all__ = [
+    "HFJobsExecutor",
     "LocalProcessExecutor",
     "MockManager",
+    "ModalExecutor",
     "ObjectStoreAllReduce",
     "ReplicaHandle",
     "ServerlessExecutor",

composer_replication/diloco/serverless/tests/test_serverless_local.py CHANGED Viewed

@@ -248,3 +248,45 @@ def test_mock_manager_shape_compat():
         assert hasattr(work, "wait") and callable(work.wait)
         assert work.wait() is True
         torch.testing.assert_close(buf, t, atol=1e-6, rtol=1e-6)

         assert hasattr(work, "wait") and callable(work.wait)
         assert work.wait() is True
         torch.testing.assert_close(buf, t, atol=1e-6, rtol=1e-6)
+# ---------------------------------------------------------------------
+# Public re-export surface (Wave 17a)
+# ---------------------------------------------------------------------
+def test_public_reexports_include_all_executors():
+    """`from composer_replication.diloco.serverless import …` must
+    surface every executor adapter the module's docstring claims, not
+    just the LocalProcessExecutor.
+    Wave 16's user-journey reviewer caught that ModalExecutor /
+    HFJobsExecutor were defined in `modal.py` / `hf_jobs.py` but not
+    re-exported from the package's `__init__.py`. Users who copied the
+    docstring's `from composer_replication.diloco.serverless import
+    ModalExecutor` line got an ImportError. Wave 17a added the missing
+    re-exports; this test pins them.
+    """
+    import composer_replication.diloco.serverless as ss
+    expected = {
+        "LocalProcessExecutor",
+        "ModalExecutor",
+        "HFJobsExecutor",
+        "MockManager",
+        "ObjectStoreAllReduce",
+        "ReplicaHandle",
+        "ServerlessExecutor",
+    }
+    actual = set(ss.__all__)
+    assert expected.issubset(actual), (
+        f"Missing re-exports: {expected - actual}. "
+        f"__all__ should include every executor adapter the package "
+        f"docstring documents."
+    )
+    # Also verify each name is actually importable, not just listed.
+    for name in expected:
+        assert hasattr(ss, name), (
+            f"{name} listed in __all__ but not present on package."
+        )

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md CHANGED Viewed

@@ -621,16 +621,47 @@ if __name__ == "__main__":
 ### 3.4 Package layout
-<!-- AUDIT: stale_serverless_layout — ADR-005 shipped a flatter layout than this
-     proposal. Actual modules under composer_replication/diloco/serverless/
-     are: __init__.py, executor.py (ServerlessExecutor + LocalProcessExecutor),
-     allreduce.py (ObjectStoreAllReduce + MockManager), modal.py (ModalExecutor),
-     hf_jobs.py (HFJobsExecutor), replica_entrypoint.py. No leading underscores,
-     no _protocol/_base/_rendezvous split, and Modal/HFJobs are flat modules
-     rather than subpackages. The above code-block file headers (e.g.
-     `_modal_adapter.py`, `_hf_jobs_adapter.py`, `_protocol.py`, `_rendezvous.py`)
-     are pre-implementation proposals; map them to the realised module names
-     when reading. -->
 ```
 composer_replication/

 ### 3.4 Package layout
+> **Realised in v0.1 (Wave 17 update):** ADR-005 shipped a flatter
+> layout than the proposal below. The actual `composer_replication/diloco/serverless/`
+> tree:
+>
+> ```
+> composer_replication/
+> └── diloco/
+>     ├── __init__.py            # existing: make_diloco_outer_loop, torchft import
+>     └── serverless/
+>         ├── __init__.py        # re-exports all public classes (Wave 17a)
+>         ├── executor.py        # ServerlessExecutor Protocol + ReplicaHandle
+>         │                      #   + LocalProcessExecutor concrete adapter
+>         ├── allreduce.py       # ObjectStoreAllReduce + MockManager
+>         ├── modal.py           # ModalExecutor (skeleton — see __init__ docstring)
+>         ├── hf_jobs.py         # HFJobsExecutor (skeleton — uses huggingface_hub.run_job)
+>         ├── replica_entrypoint.py    # script each replica runs
+>         └── tests/             # multi-process file:// rendezvous tests
+> ```
+>
+> No leading underscores, no `_protocol`/`_base`/`_rendezvous` split,
+> and Modal/HFJobs are flat modules rather than subpackages. The full
+> public re-export surface (verified by
+> `tests/test_serverless_local.py::test_public_reexports_include_all_executors`):
+>
+> ```python
+> from composer_replication.diloco.serverless import (
+>     ServerlessExecutor,        # Protocol — implement to add your own backend
+>     LocalProcessExecutor,      # multi-process local replicas (CPU/GPU)
+>     ModalExecutor,             # Modal cloud — skeleton in modal.py
+>     HFJobsExecutor,            # HuggingFace Jobs — skeleton in hf_jobs.py
+>     ObjectStoreAllReduce,      # fsspec-backed allreduce (s3://, gs://, file://, hf://)
+>     MockManager,               # torchft.Manager-shaped duck-type
+>     ReplicaHandle,             # opaque handle returned by launch_replicas
+> )
+> ```
+>
+> Wave 16's user-journey reviewer caught that earlier versions of this
+> `__init__.py` defined `ModalExecutor` and `HFJobsExecutor` in their
+> respective modules but failed to re-export them from the package
+> namespace. Wave 17a fixed the re-exports and added a regression
+> test. The proposal below predates that fix.
 ```
 composer_replication/

docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md CHANGED Viewed

@@ -312,19 +312,46 @@ write_jsonl(out_path, pairs)
 ### 4.3 Adapter shape (`replaysim/normalize.py`)
-<!-- AUDIT: stale_replaysim_paths_and_dpo_shape — ADR-004 shipped at
-     composer_replication/replaysim/normalize.py with a different DPOPair shape
-     than this sketch. Actual DPOPair is a TypedDict with fields
-     {state_id, state_messages, chosen: str, rejected: str, n_teachers_agreeing}
-     — NOT {prompt, chosen, rejected, state, meta} as in the proposal below. The
-     YAML recipe also lives at composer_replication/recipes/replaysim/default.yaml
-     (not composer_replication/replaysim/recipes/dpo_normalize.yaml). The hook
-     in §4.5 is provided by `replay_and_normalize_trace` in
-     composer_replication/replaysim/__init__.py rather than a drop-in edit to
-     `teacher_replay.py`. The custom op file (§4.4 line 426 / §4.4 line 431)
-     `composer_replication/replaysim/ops/preference_validator.py` was not
-     created. Treat the sketch below as proposal, not as documentation of
-     the realised code. -->
 ```python
 # composer_replication/replaysim/normalize.py

 ### 4.3 Adapter shape (`replaysim/normalize.py`)
+> **Realised in v0.1 (Wave 17 update):** ADR-004 shipped with a different
+> public surface than the sketch below. The actual API:
+>
+> ```python
+> from composer_replication.replaysim import (
+>     replay_and_normalize_trace,    # convenience wrapper
+>     DJNormalizer,                  # the normalizer class
+>     DPOPair,                       # input TypedDict (from teacher_replay)
+>     NormalizedDPOPair,             # output TypedDict
+>     replay_trace, extract_dpo_pairs,  # re-exports of upstream stages
+> )
+> ```
+>
+> Key shape differences from the sketch:
+>
+> 1. **`DPOPair` is a TypedDict, not a dataclass.** Its actual fields
+>    are `{state_id: str, state_messages: list[dict], chosen: str,
+>    rejected: str, n_teachers_agreeing: int}` (defined in
+>    `composer_replication/teacher_replay.py:99`) — **not**
+>    `{prompt, chosen, rejected, state, meta}`. The `_to_dj`/`_from_dj`
+>    sketch round-trip below would not type-check against the realised
+>    TypedDict.
+> 2. **Recipe path is `composer_replication/recipes/replaysim/default.yaml`**,
+>    not `composer_replication/replaysim/recipes/dpo_normalize.yaml`.
+>    There is no `replaysim/recipes/` subpackage; recipes live under
+>    the top-level `recipes/` tree.
+> 3. **No `composer_replication/replaysim/ops/` subpackage exists.**
+>    The custom op file `preference_validator.py` was not created;
+>    data-juicer's stock ops + the framework's own validation in
+>    `DJNormalizer` covered the requirement.
+> 4. **The integration hook is `replay_and_normalize_trace(...)`** in
+>    `composer_replication/replaysim/__init__.py` (re-exported from
+>    `normalize.py`). It wraps the existing `replay_trace` +
+>    `extract_dpo_pairs` flow without modifying `teacher_replay.py`.
+>    There is no separate `composer_replication/replaysim/teacher_replay.py`
+>    — `teacher_replay` lives at top-level `composer_replication/teacher_replay.py`.
+>
+> The pre-spike sketch below is preserved as historical proposal context.
+> It documents the shape of thinking that fed ADR-004; the realised code
+> is the source of truth for the adapter contract.
 ```python
 # composer_replication/replaysim/normalize.py

docs/research/RL_FRAMEWORKS_LANDSCAPE.md CHANGED Viewed

@@ -313,10 +313,42 @@ group_size = 16
 [trainer]
 algorithm = "grpo"
-<!-- AUDIT: stale_recipe_format — Wave 14b shipped this as YAML at
-     composer_replication/recipes/prime_rl/prime_rl_config.yaml with a different
-     kwarg surface (alpha_sdpo, beta_dpo, dppo_mask_high, dppo_mask_low, adv_tau,
-     kl_tau). The TOML/hint_weight/replay_weight sketch below predates that. -->
 [trainer.loss]
 type = "custom"
 import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"

 [trainer]
 algorithm = "grpo"
+> **Realised in v0.1 (Wave 17 update):** Wave 14b shipped the PRIME-RL
+> recipe at `composer_replication/recipes/prime_rl/prime_rl_config.yaml`
+> as **YAML** with a different kwarg surface than the TOML sketch below.
+> The actual recipe shape:
+>
+> ```yaml
+> # composer_replication/recipes/prime_rl/prime_rl_config.yaml
+> model:
+>   base: "Qwen/Qwen2.5-0.5B"
+>   attn_implementation: "flash_attention_2"
+>   dtype: "bfloat16"
+> env:
+>   protocol: "verifiers"
+>   config: { name: "math/gsm8k", split: "train" }
+> loss:
+>   custom:
+>     import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
+>     kwargs:
+>       alpha_sdpo:     0.0      # channel 2 deferred in v0
+>       beta_dpo:       0.0      # channel 3 out-of-scope for PRIME-RL v0
+>       dppo_mask_high: 0.2      # PRIME-RL DPPO convention (NOT textbook PPO)
+>       dppo_mask_low:  0.2      #   both must be >= 0 per Field(..., ge=0)
+>       adv_tau:        1.0      # advantage normalization
+>       kl_tau:         0.04     # KL coefficient
+> ```
+>
+> The realised `loss_fn(inputs, **kwargs)` matches PRIME-RL's
+> `LossInputs`/`LossOutputs` interface (read upstream `prime_rl/loss.py`
+> for parity verification — Wave 14b's shadow-parity test independently
+> restates the formula in
+> `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`).
+>
+> The pre-Wave-14b TOML/`hint_weight`/`replay_weight` sketch below is
+> preserved as historical proposal context.
 [trainer.loss]
 type = "custom"
 import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"

docs/research/SELF_DISTILLATION_LANDSCAPE.md CHANGED Viewed

@@ -352,13 +352,42 @@ license + reproducible scale) to recommend adding right now.
 For ADR-007 the proposed addition is a `composer_replication.distillation`
 sub-package with three pluggable hooks:
-<!-- AUDIT: stale_distillation_layout — ADR-007 shipped a flatter layout than
-     this proposal. Actual modules: composer_replication/distillation/{simpo.py,
-     taid.py, entropy_aware_opd.py}. There is no targets.py/losses.py split,
-     no top-level preference/ subpackage, and SimPO lives under distillation/
-     rather than preference/. The function names also differ: actual exports
-     are `simpo_loss`, `taid_loss` + `TAIDScheduler`, and `entropy_aware_opd_loss`
-     (not `taid_target` / `entropy_aware_kl_loss`). -->
 ```
 composer_replication/

 For ADR-007 the proposed addition is a `composer_replication.distillation`
 sub-package with three pluggable hooks:
+> **Realised in v0.1 (Wave 17 update):** ADR-007 shipped a flatter
+> layout than the proposal below. Actual exports:
+>
+> ```
+> composer_replication/
+>   distillation/
+>     __init__.py
+>     simpo.py              # simpo_loss(chosen_avg_logprobs, rejected_avg_logprobs, *, beta, gamma)
+>                           # avg_sequence_logprob(logprobs, mask)  -- helper
+>     taid.py               # taid_loss(student_logits, teacher_logits, t, *, ...)
+>                           # TAIDScheduler  -- adaptive momentum schedule per the paper
+>     entropy_aware_opd.py  # entropy_aware_opd_loss(student_logits, teacher_logits, *, h_max, ...)
+> ```
+>
+> No `targets.py`/`losses.py` split, no top-level `preference/` package,
+> and SimPO lives under `distillation/` rather than `preference/` because
+> the three losses share a common dispatch surface (`compose_loss`'s
+> `dpo_variant` and `sdpo_wrapper` switches).
+>
+> The composition rule realised in `compose_loss` is per-loss flag-driven,
+> not a single composed-function call:
+>
+> ```python
+> compose_loss(model, inputs,
+>     dpo_variant="simpo",          # OR "dpo" (default)
+>     sdpo_wrapper="taid",          # OR "entropy_opd" OR "none" (default)
+>     taid_t=0.5,                    # required when sdpo_wrapper="taid"
+>     simpo_beta=2.0, simpo_gamma=1.0,  # used only when dpo_variant="simpo"
+>     entropy_opd_h_max=...,         # used only when sdpo_wrapper="entropy_opd"
+> )
+> ```
+>
+> The pre-ADR proposal sketch below is preserved as historical context.
+> The shipped function names are `simpo_loss`, `taid_loss` +
+> `TAIDScheduler`, and `entropy_aware_opd_loss` (not `taid_target` /
+> `entropy_aware_kl_loss`).
 ```
 composer_replication/

docs/research/TRACE_SOURCE_RECONNAISSANCE.md CHANGED Viewed

@@ -244,12 +244,35 @@ For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k
 ## 6. TraceIngester sketch
-<!-- AUDIT: stale_ingester_paths_and_naming — Spike 007 shipped at
-     spikes/007-real-trace-ingestion/claude_code_ingester.py (NOT
-     spikes/007-trace-ingester/trace_ingester.py) and the production-side
-     module is composer_replication/ingestion/claude_code.py exporting
-     `ClaudeCodeIngester` (NOT `TraceIngester`). The sketch below is the
-     pre-spike proposal; the realised API surface is named differently. -->
 Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).

 ## 6. TraceIngester sketch
+> **Realised in v0.1 (Wave 17 update):** The realised ingester ships at
+> `composer_replication/ingestion/claude_code.py` exporting
+> `ClaudeCodeIngester`, with the spike at
+> `spikes/007-real-trace-ingestion/claude_code_ingester.py`. The
+> public production surface is:
+>
+> ```python
+> from pathlib import Path
+> from composer_replication.ingestion.claude_code import ClaudeCodeIngester
+>
+> ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True)
+> for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()):
+>     # trace_state matches the TraceState TypedDict from §1
+>     ...
+> stats = ingester.last_stats  # IngestionStats — turn counts, skip reasons
+> ```
+>
+> The shipped `ClaudeCodeIngester` differs from the pre-spike sketch
+> below in:
+> - Class name: `ClaudeCodeIngester` (not `TraceIngester`)
+> - Module path: `composer_replication.ingestion.claude_code` (not
+>   `spikes/007-trace-ingester/trace_ingester.py`)
+> - The constructor takes config kwargs (`system_prompt`,
+>   `skip_sidechain`, `strip_thinking`, `max_history_tokens`); paths
+>   are passed to `.ingest(Path)` per call instead of being held by the
+>   ingester
+> - The yielded type is `TraceState` (matches §1)
+>
+> The pre-spike sketch below is preserved as historical proposal context.
 Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).

docs/research/WAVE_16_RECON_AUDIT.md CHANGED Viewed

@@ -125,6 +125,16 @@ spike-shape names that do not match what shipped.
 ## Open items for Wave 17+
 These are the FLAGged ambiguous claims that need orchestrator decision before
 a confident rewrite:

 ## Open items for Wave 17+
+> **Wave 17 closeout (2026-05-26):** All 5 FLAGs below were resolved
+> by adding "Realised in v0.1" companion sections to the affected
+> proposal docs that document the shipped surface inline above the
+> historical sketch. Wave 17 also fixed the underlying code bug from
+> FLAG #1 — `serverless/__init__.py` now re-exports `ModalExecutor`
+> and `HFJobsExecutor` (verified by
+> `composer_replication/diloco/serverless/tests/test_serverless_local.py::test_public_reexports_include_all_executors`).
+> The original FLAG-list below is preserved as historical record of
+> what the audit found.
 These are the FLAGged ambiguous claims that need orchestrator decision before
 a confident rewrite:

examples/gsm8k_grpo_with_sdpo/README.md CHANGED Viewed

@@ -23,23 +23,26 @@ The script will print 5 SGD steps' worth of channel-decomposed losses
 and end with three ✓ assertions:
 ```
-  step 1/5: total=5.9801  lm_ce=5.9087  sdpo_jsd=0.1429  trace_replay_dpo=0.0000  |grad|=6.45e+06
-  step 2/5: total=4.2268  lm_ce=4.1573  sdpo_jsd=0.1390  trace_replay_dpo=0.0000  |grad|=1.20e+06
   ...
-  step 5/5: total=2.4644  lm_ce=2.3962  sdpo_jsd=0.1363  trace_replay_dpo=0.0000  |grad|=1.03e+06
 [4/4] Verifying SDPO column wiring ...
-  ✓ sdpo_jsd > 0 at every step (min=0.1358, max=0.1429)
-  ✓ total != lm_ce at every step (min |diff|=0.0679, max=0.0714)
-  ✓ |grad| > 0 and finite at every step (min=1.01e+06, max=6.45e+06)
 ✅ SDPO column wiring verified end-to-end.
 ```
-Wall-clock on the reference run: **16.5s** for 5 SGD steps after a
-1.7s model-load phase (no model download — already cached). The SDPO
-signal magnitude (~0.14) is meaningful here because the script uses
-Qwen's actual ChatML markers (`<|im_start|>` / `<|im_end|>`) via
-`tokenizer.apply_chat_template` — not raw marker strings, which would
-be tokenized as 11 punctuation tokens and the model would see nonsense.
 If `sdpo_jsd` ever shows up as `0.0000`, the SDPO column is silent —
 that means either (a) `alpha_sdpo=0`, (b) `ctx_teacher_input_ids` is

 and end with three ✓ assertions:
 ```
+  step 1/5: total=2.2215  lm_ce=2.1898  sdpo_jsd=0.0634  trace_replay_dpo=0.0000  |grad|=1.38e+06
+  step 2/5: total=1.7695  lm_ce=1.7374  sdpo_jsd=0.0642  trace_replay_dpo=0.0000  |grad|=1.12e+06
   ...
+  step 5/5: total=1.2781  lm_ce=1.2465  sdpo_jsd=0.0631  trace_replay_dpo=0.0000  |grad|=8.24e+05
 [4/4] Verifying SDPO column wiring ...
+  ✓ sdpo_jsd > 0 at every step (min=0.0611, max=0.0642)
+  ✓ total != lm_ce at every step (min |diff|=0.0306, max=0.0321)
+  ✓ |grad| > 0 and finite at every step (min=8.24e+05, max=1.38e+06)
 ✅ SDPO column wiring verified end-to-end.
 ```
+Wall-clock on the reference run: **21.3s** for 5 SGD steps after a
+1.7s model-load phase (no model download — already cached). The script
+left-pads and left-truncates the chat-template'd input so student and
+teacher contexts are bit-identical on the right-most 16 positions —
+the same alignment discipline production SDPO requires (see `build_inputs`
+docstring for the alignment rationale and the link to
+`composer_replication/trainer/data_collator.py`). Without left-truncation
+the assistant marker gets dropped and the SDPO mask covers prompt-area
+tokens instead, inflating the channel signal on misaligned positions.
 If `sdpo_jsd` ever shows up as `0.0000`, the SDPO column is silent —
 that means either (a) `alpha_sdpo=0`, (b) `ctx_teacher_input_ids` is

examples/gsm8k_grpo_with_sdpo/run.py CHANGED Viewed

@@ -106,15 +106,36 @@ def build_inputs(tokenizer) -> dict[str, torch.Tensor]:
     """Tokenize PROBLEMS into a compose_loss-shaped batch.
     Returns a dict with:
-      - input_ids:              (B, T) student rollouts (no hint)
-      - response_mask:          (B, T)
-      - ctx_teacher_input_ids:  (B, T) hint-conditioned context
-      - sdpo_loss_mask:         (B, T) 1 at assistant-response tokens
     """
     student_msg_lists = [_build_chat_messages(p["question"], with_hint=False) for p in PROBLEMS[:B]]
     teacher_msg_lists = [_build_chat_messages(p["question"], with_hint=True) for p in PROBLEMS[:B]]
-    # Render via Qwen's chat template — produces real <|im_start|>/<|im_end|> tokens.
     student_strs = [
         tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
         for m in student_msg_lists
@@ -124,26 +145,40 @@ def build_inputs(tokenizer) -> dict[str, torch.Tensor]:
         for m in teacher_msg_lists
     ]
-    s_tok = tokenizer(
-        student_strs,
-        max_length=T,
-        truncation=True,
-        padding="max_length",
-        return_tensors="pt",
-    )
-    t_tok = tokenizer(
-        teacher_strs,
-        max_length=T,
-        truncation=True,
-        padding="max_length",
-        return_tensors="pt",
-    )
-    # Mark the second half of each sequence as the "response" — purely
-    # synthetic for this smoke; in real training the response_mask comes
-    # from the rollout pipeline.
     response_mask = torch.zeros(B, T, dtype=torch.long)
-    response_mask[:, T // 2:] = 1
     sdpo_loss_mask = response_mask.clone()
     return {

     """Tokenize PROBLEMS into a compose_loss-shaped batch.
     Returns a dict with:
+      - input_ids:              (B, T) student rollouts (no hint), left-padded
+      - response_mask:          (B, T) 1 on the assistant-response area
+      - ctx_teacher_input_ids:  (B, T) hint-conditioned context, left-padded
+      - sdpo_loss_mask:         (B, T) 1 at the aligned post-prompt area
+    SDPO requires student and teacher logits to align position-by-position
+    over the loss mask. The student and teacher prompts have different
+    prefix lengths (teacher is longer because of the inserted hint
+    system turn), so we LEFT-pad both to T tokens — the right edge (the
+    assistant generation marker) lines up across the batch and across
+    student vs teacher. The SDPO mask covers the right-most ALIGN_LEN
+    positions, all of which correspond to identical "post-prompt /
+    assistant-response area" tokens in both contexts.
+    This matches the alignment discipline the production
+    `ComposerDataCollator` (composer_replication/trainer/data_collator.py)
+    must enforce: the post-hint section must have identical token
+    positions in student vs teacher, or `_compute_sdpo_loss` will
+    detect a shape mismatch and skip the channel for that step.
     """
+    # ALIGN_LEN: how many right-most positions to use for the SDPO loss.
+    # These positions correspond to the assistant-generation area, which
+    # is identical (token-for-token) across student and teacher because
+    # apply_chat_template appends the same `<|im_start|>assistant\n`
+    # marker regardless of how many system turns came before.
+    ALIGN_LEN = T // 2  # 16 of 32; same as response_mask back-half
     student_msg_lists = [_build_chat_messages(p["question"], with_hint=False) for p in PROBLEMS[:B]]
     teacher_msg_lists = [_build_chat_messages(p["question"], with_hint=True) for p in PROBLEMS[:B]]
     student_strs = [
         tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
         for m in student_msg_lists
         for m in teacher_msg_lists
     ]
+    # LEFT-pad AND LEFT-truncate to T: temporarily flip both the
+    # tokenizer's padding side and truncation side. This ensures the
+    # right edge (the assistant generation marker) is preserved at
+    # position T-1 regardless of whether the input is shorter than T
+    # (gets left-padded) or longer than T (gets left-truncated, dropping
+    # the leading system turns first). Without this, the default
+    # right-truncation discards the assistant marker — which means the
+    # SDPO mask covers tokens from the system prompt instead of the
+    # assistant response area, and the channel computes JSD over
+    # nonsense alignment.
+    original_pad = tokenizer.padding_side
+    original_trunc = tokenizer.truncation_side
+    tokenizer.padding_side = "left"
+    tokenizer.truncation_side = "left"
+    try:
+        s_tok = tokenizer(
+            student_strs, max_length=T, truncation=True,
+            padding="max_length", return_tensors="pt",
+        )
+        t_tok = tokenizer(
+            teacher_strs, max_length=T, truncation=True,
+            padding="max_length", return_tensors="pt",
+        )
+    finally:
+        tokenizer.padding_side = original_pad
+        tokenizer.truncation_side = original_trunc
+    # response_mask: 1 on the right-most ALIGN_LEN tokens, 0 elsewhere
+    # (left padding + prompt area). For both student and teacher these
+    # positions cover the assistant-generation marker + any padding
+    # that happens to fall there. Same indices apply to both because
+    # of left-padding alignment.
     response_mask = torch.zeros(B, T, dtype=torch.long)
+    response_mask[:, -ALIGN_LEN:] = 1
     sdpo_loss_mask = response_mask.clone()
     return {