architect: ADR-011/012/013 + research (alignment-index fix, review-findings closure, LMA channel-ladder)

Browse files

Files changed (6) hide show

docs/adrs/ADR-011-sdpo-alignment-indices.md +102 -0
docs/adrs/ADR-012-close-review-findings.md +107 -0
docs/adrs/ADR-013-lma-integration-channel-ladder.md +113 -0
docs/adrs/README.md +3 -0
research/11-sdpo-alignment-indices.md +232 -0
research/12-altered-model-rl-critique.md +66 -0

docs/adrs/ADR-011-sdpo-alignment-indices.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+status: accepted
+date: 2026-05-29
+amends: ADR-008
+deciders: [Codeseys, ARIA]
+---
+# ADR-011: Collator-emitted SDPO alignment indices (close the strict-guard regression)
+## Context and Problem Statement
+The 2026-05-29 cross-family review of ADR-008 found the SDPO student/teacher
+alignment guard was a *shape-only* check (`student_logits.shape ==
+teacher_logits.shape`), which does not establish token-level alignment because
+the teacher context has a hint inserted at the error turn (shifting response
+tokens right). The fix made `ComposerReplicationTrainer._compute_sdpo_loss`
+**require** explicit `student_response_idx` / `teacher_response_idx` LongTensors
+and `torch.gather` the aligned post-hint logits before JSD, raising in strict
+mode (the default) when they are absent.
+**Regression introduced:** the production collator
+(`composer_replication/trainer/data_collator.py`) does NOT emit those index
+tensors, so the default (strict) SDPO path now raises against the real collator.
+This ADR closes that gap.
+The collator already solves the hard alignment problem: `_build_aligned_student_for_sdpo`
+builds a student sequence that mirrors the hint-conditioned teacher by inserting
+a placeholder system-message of identical token length where the teacher has the
+hint, and `_build_chat_aligned_mask` (per-message `apply_chat_template`
+prefix-delta subsequence matching) marks the post-hint recovery-turn content
+tokens with 1 in both `sdpo_loss_mask` (teacher) and `response_mask` (student).
+So the 1-positions of both masks already correspond to the same logical tokens.
+Research: `/tmp/composer-research/r1-alignment-indices.md` (DeepSeek V4 Pro,
+2026-05-29).
+## Decision Drivers
+- The loss already requires the indices; the collator must supply them or strict
+  SDPO is unusable (it raises).
+- The indices are *derivable from masks the collator already computes correctly*
+  — no extra tokenizer calls, no new alignment logic.
+- The contract must be forward-compatible: if the placeholder trick is ever
+  dropped for dynamic-length alignment, distinct student/teacher indices still
+  describe the alignment.
+- Ragged K (different rows have different #error-turn tokens) must be handled
+  without silent padding-token contributions to the JSD.
+## Considered Options
+- **A. Derive indices on-the-fly inside the loss from `sdpo_loss_mask`** — couples
+  the loss to the collator's placeholder implementation detail; rejected.
+- **B. Collator emits explicit `student/teacher_response_idx` + `*_valid` masks,
+  derived from the existing chat-aligned masks** (chosen).
+- **C. Drop the index requirement, revert to shape-check** — re-opens the P0 the
+  review caught; rejected.
+## Decision Outcome
+Chosen: **Option B.** The collator emits four new batch keys when SDPO is active:
+`student_response_idx` (B, K_max), `teacher_response_idx` (B, K_max),
+`student_response_valid` (B, K_max bool), `teacher_response_valid` (B, K_max bool).
+A `_mask_to_padded_indices(mask, pad_sentinel=-1)` helper converts a (B, T)
+response mask to a padded (B, K_max) index tensor + validity mask (sentinel -1
+for ragged padding). The loss masks sentinel positions by building an
+`aligned_labels` tensor (1 where valid, -100 elsewhere) passed to
+`generalized_jsd_loss` (which already honors the -100 ignore convention).
+### Consequences
+- **Positive**: strict SDPO works against the real collator; the silent-misalignment
+  P0 stays closed; no extra forward/tokenizer passes.
+- **Positive**: forward-compatible — distinct indices survive a non-placeholder future.
+- **Neutral**: a debug-mode assertion `(s_idx == t_idx)[valid].all()` can verify the
+  placeholder trick is still intact when sequences are same-length.
+- **Negative**: +4 batch keys; documented in the collator output contract.
+## Acceptance gate (must be green before status flips to accepted)
+- [ ] `_mask_to_padded_indices` implemented; ragged-K rows pad to K_max with
+  sentinel -1 + a `*_valid` bool tensor. Unit test: 2 rows with K=3 and K=1 →
+  (2, 3) idx with row-1 tail = -1 and valid[1] = [T,F,F].
+- [ ] `ComposerDataCollator.__call__` emits the 4 keys whenever
+  `sdpo_loss_mask` + `response_mask` are present. Unit test asserts presence +
+  shapes + that `student_response_idx == teacher_response_idx` at valid
+  positions for the same-length placeholder path.
+- [ ] `_compute_sdpo_loss` masks sentinels via `aligned_labels` (1/-100); a
+  sentinel position contributes 0 to the JSD. Unit test: a 2-row batch with
+  ragged K produces a finite loss and the K=1 row's padding doesn't leak.
+- [ ] End-to-end: real `ComposerDataCollator` (with a stub tokenizer + a hint
+  generator) → batch → `_compute_sdpo_loss` runs in **strict mode** without
+  raising and returns a finite, positive loss. (This is the regression the ADR
+  closes — it must be a test, not a claim.)
+- [ ] No regression: the existing alignment tests in
+  `test_dr_grpo_config_and_alignment.py` still pass.
+## More Information
+- `/tmp/composer-research/r1-alignment-indices.md` — full design + code sketch.
+- ADR-008 — the strict-guard fix this ADR completes (amends).
+- `composer_replication/trainer/data_collator.py` `_build_chat_aligned_mask`,
+  `_build_aligned_student_for_sdpo`.

docs/adrs/ADR-012-close-review-findings.md ADDED Viewed

	@@ -0,0 +1,107 @@

+---
+status: accepted
+date: 2026-05-29
+amends: [ADR-008, ADR-009, ADR-010]
+deciders: [Codeseys, ARIA]
+---
+# ADR-012: Close the open cross-family-review findings (KL estimator, hint routing, AST provenance, curriculum signals)
+## Context and Problem Statement
+The 2026-05-29 cross-family review left five OPEN follow-ups after the
+silent-misalignment P0 (ADR-011) was fixed. None corrupt a run, but they are
+fidelity/correctness gaps that the "accepted" ADRs acknowledged. This ADR
+bundles the four that are CPU-fixable now (the fifth — Docker substrate e2e — is
+ADR-010's hardware-blocked gate, tracked separately in ADR-013/B6).
+The findings (verified against code during the review):
+1. **k1 KL estimator unasserted (ADR-008, 2 reviewers).** `make_dr_grpo_config`
+   claims Composer uses the k1 (`−log r`) KL estimator and that "TRL's native
+   estimator" satisfies this, but nothing configures or verifies it against TRL
+   1.5.0's actual GRPO KL branch.
+2. **Hint default-routing eager raw-error (ADR-009, GPT-5.5 P1).** The default
+   composite is template → raw-error → judge; any uncovered *style/communication/
+   effort* site that carries an `error_message` is consumed by the raw-error
+   layer and never reaches the LLM judge — exactly the sites the judge exists to
+   cover. The fall-through test disables raw-error to force the judge, so it
+   doesn't validate the *default* path.
+3. **HackMonitor overclaimed as AST-provenance (ADR-010, DeepSeek P0).** It is a
+   substring matcher, defeated by string-concat (`"__py"+"cache__"`). The ADR's
+   §3c calls it an "AST provenance monitor."
+4. **Curriculum ignores turns/think-tokens (ADR-010, 2 reviewers).** The Composer
+   2 tech report keys the curriculum on rollout #turns + thinking-token count;
+   the implementation tracks only pass-rate.
+## Decision Outcome
+Fix all four CPU-fixable findings:
+1. **k1 KL — assert, don't just claim.** Add a `kl_estimator` check in
+   `make_dr_grpo_config`: probe TRL 1.5.0's GRPO KL branch (the `_compute_kl` /
+   `loss_type` path) and assert it is the k1 (`−log r`) family, OR — if TRL's
+   estimator is k3 — document that explicitly and expose a `kl_estimator="k1"`
+   knob that the trainer applies in its own KL term. Add a unit test computing KL
+   for known logprob pairs and asserting the k1 value (`−log r`, i.e.
+   `ref_logp − logp` summed), distinguishing it from k3
+   (`exp(Δ) − Δ − 1`). If TRL cannot be forced to k1, the test documents the
+   delta and the ADR-008 claim is narrowed to "k1 in our own KL term; parent GRPO
+   KL is TRL's default."
+2. **Hint routing — error-kind aware.** Introduce a routing policy in
+   `default_composite`: tool/runtime error kinds may use the raw-error layer;
+   style/communication/effort kinds **skip raw-error and go to the judge**. A
+   `RoutingHintGenerator` (or a `route` predicate on the composite) implements
+   it. Test the *default* composite directly: a style site reaches the judge even
+   when it carries an `error_message`.
+3. **HackMonitor — either implement AST provenance or re-scope honestly.**
+   Implement a lightweight provenance check: scan the agent's *patch/diff* for
+   reintroduced `deleted_symbols` whose surrounding context indicates a
+   non-implementation path (copied from a cache dump, a decompiler output, a
+   sibling import that smuggles the symbol). Where full AST is overkill, use a
+   structured check: a deleted symbol reappearing verbatim adjacent to a
+   cache/decompiler/file-read action in the trajectory → flag. Keep the substring
+   layer as defense-in-depth. Update ADR-010 §3c language to "signature +
+   patch-provenance monitor" (not "AST"). Test: a string-concat-obfuscated cache
+   read is still flagged via the patch-provenance path (or the ADR honestly
+   documents the residual bypass).
+4. **Curriculum turn/think-token signals.** Extend `DifficultyCurriculum.update`
+   to accept optional `turns: float` and `think_tokens: float` per exposure,
+   track per-task moving averages, and incorporate them into the difficulty
+   weighting (higher turns/think-tokens at the same pass-rate ⇒ harder ⇒ keep on
+   the frontier longer). Backward-compatible (both optional, default None).
+   Test: two tasks with identical pass-rate but different mean turns weight the
+   higher-turn task ≥ the lower-turn task.
+### Consequences
+- **Positive**: the ADRs' claims now match the code; the review's OPEN list
+  shrinks to just the Docker-gated item.
+- **Neutral**: the HackMonitor remains heuristic (it always will be); the ADR
+  language is the thing being corrected, plus a real patch-provenance layer.
+- **Negative**: if TRL 1.5.0's KL genuinely can't be forced to k1, finding #1 is
+  a documentation fix + a knob, not a behavior change — that's an acceptable
+  honest outcome, recorded in the test.
+## Acceptance gate
+- [ ] k1 KL: unit test computes k1 vs k3 for known logprob pairs; the trainer's
+  effective KL term is asserted k1 (or the delta is documented + a `kl_estimator`
+  knob exists). ADR-008's KL bullet updated to match reality.
+- [ ] Hint routing: `default_composite` routes style/communication/effort sites
+  to the judge even with an `error_message` present; tested on the DEFAULT
+  composite (not a raw-error-disabled variant). ADR-009 routing note updated.
+- [ ] HackMonitor: patch-provenance check flags a string-concat-obfuscated cache
+  read OR the residual bypass is honestly documented; ADR-010 §3c re-worded from
+  "AST provenance" to the accurate description.
+- [ ] Curriculum: `update(turns=, think_tokens=)` optional args + moving averages
+  + weighting; backward-compatible; tested.
+- [ ] Full suite green, no regressions.
+## More Information
+- Cross-family review: `docs/reviews/cross-family-adr-008-009-010-2026-05-29/`.
+- ADRs 008/009/010 "Post-acceptance cross-family review" sections.

docs/adrs/ADR-013-lma-integration-channel-ladder.md ADDED Viewed

	@@ -0,0 +1,113 @@

+---
+status: accepted
+date: 2026-05-29
+supersedes_section: docs/ALTERED_MINDS_TIE_IN.md §"Concrete plan" Phase 3 hyperparameters
+deciders: [Codeseys, ARIA]
+---
+# ADR-013: llm-mental-alterations (LMA) integration — isolated-channel ladder, not combined recipe
+## Context and Problem Statement
+The north-star use case for this framework is driving the sister project
+**llm-mental-alterations** (LMA): take an LMA personality-altered SFT checkpoint
+(Llama-3.1-8B with depression/anxiety/etc. induction) and apply this framework's
+3-channel RL (Dr.GRPO + SDPO + trace-replay-DPO) to ask whether task-driven RL
+**washes out, preserves, or amplifies** the alteration's cognitive-distortion
+signature (the multi-seed −31pp MMLU moral_scenarios class-3 collapse).
+`docs/ALTERED_MINDS_TIE_IN.md` (Wave 13) proposed a Phase-3 run with
+`alpha_sdpo=0.2`, `beta_replay=0.4`, all channels ON simultaneously. A
+cross-family research critique (GPT-5.5, 2026-05-29,
+`/tmp/composer-research/r2-altered-model-rl.md`) found this **scientifically
+uninterpretable**: a combined run confounds four effects (task RL,
+self-distillation of altered reasoning, frontier-teacher imitation, KL
+anchoring), so any observed change cannot be attributed. Worse, **SDPO against
+the altered model's own hint-conditioned forward pass is the channel most likely
+to AMPLIFY the distortion** (teacher==student-family; if hints don't add
+independent information, the optimum is to imitate the altered conditional
+distribution, sharpening a soft bias into a hard preference). SDPO here is an
+*experimental intervention*, not a benign stabilizer.
+## Decision Drivers
+- The deliverable is a *usable, interpretable* integration, not just "it runs."
+- Attribution requires isolating channels; combined-first defeats the experiment.
+- Reward must resist the altered model letter-format-hacking MMLU.
+- Must measure washout vs amplification: dual-KL logging (to altered-init AND to
+  unaltered-base) is the instrument.
+- This framework stays generic; LMA-specific code lives in LMA's repo using the
+  framework as a dependency (per the tie-in doc's repo-layout proposal).
+## Decision Outcome
+**Build the integration glue as a framework-side, model-agnostic scaffold +
+unit-tested runners, with the isolated-channel ladder baked in as the default
+experiment design.** Specifically:
+1. **`composer_replication/integrations/altered_minds/` (framework-side, generic)**
+   — a thin, tested adapter providing:
+   - `MMLUFormatReward`: structured-answer reward (parse final `Answer: X` /
+     JSON `{"answer":...}`; `+1` correct, `0` wrong, `−0.2` unparseable, `−0.1`
+     multiple-answers, length penalty past a rationale cap; option-order
+     randomization with original-label tracking). **Scores only the final
+     answer, never the rationale style** (avoids rewarding distorted-but-
+     persuasive reasoning).
+   - `dual_kl_logger`: logs KL(policy‖altered-init) AND KL(policy‖unaltered-base)
+     each step — the washout/amplification instrument. Optimizes neither by
+     default; both are diagnostics.
+   - `channel_ladder_configs()`: returns the A0–A4 config ladder (see below) so
+     a runner can sweep them with identical seeds/prompts.
+2. **Isolated-channel ladder (the experiment design, replaces α=0.2/β=0.4):**
+   | Arm | alpha_sdpo | beta_replay | Purpose |
+   |---|---|---|---|
+   | A0 | — | — | altered SFT, no RL (control) |
+   | A1 | 0.0 | 0.0 | GRPO-only baseline |
+   | A2 | **0.02** | 0.0 | +SDPO small (amplification probe) |
+   | A3 | 0.0 | **0.05** | +replay-DPO small (washout probe) |
+   | A4 | 0.02 | 0.05 | combined — only after A1–A3 interpretable |
+   KL-to-altered-init coef `kl_beta=0.02`, adaptive to target 0.01–0.03
+   nats/token; hard-stop/LR-cut if KL > ~0.08 or personality probes drift sharply.
+   Sweeps: `alpha_sdpo ∈ {0, 0.02, 0.05}`, `beta_replay ∈ {0, 0.05, 0.10}`.
+3. **LMA-repo runner scaffold (written to LMA, NOT this repo) — DEFERRED to an
+   explicit go:** `composer_replication_runs/{moral_scenarios_replay,train_grpo,
+   eval_post_rl}.py`. Built + unit-tested with mocks here; **not executed against
+   the real LMA budget/checkpoints without explicit user approval** (it spends
+   grant money on 8B runs).
+### Consequences
+- **Positive**: the integration is interpretable by construction; the
+  amplification hypothesis becomes testable (A2 vs A1).
+- **Positive**: framework stays generic (adapter is MMLU-format-generic, not
+  LMA-coupled); reusable for any "RL on an altered model" study.
+- **Negative**: the original tie-in doc's combined-first plan is superseded;
+  `docs/ALTERED_MINDS_TIE_IN.md` Phase-3 hyperparameters are updated to point here.
+- **Neutral**: the real 8B run stays user-gated (budget); this ADR ships the
+  *capability*, proven on a small CPU/Modal model, not the LMA result.
+## Acceptance gate
+- [ ] `MMLUFormatReward` implemented + tested: correct→+1, wrong→0,
+  unparseable→−0.2, multiple-answers→−0.1, length-penalty; a "always C" /
+  option-prior exploit is detectable via logged option distribution. Rationale
+  style is NOT scored.
+- [ ] `dual_kl_logger` logs both KLs; unit test on a toy policy/ref pair asserts
+  KL(p‖p)==0 and KL increases as the policy moves.
+- [ ] `channel_ladder_configs()` returns A0–A4 with the documented α/β/kl_beta;
+  unit test asserts A1 has both channels off, A2 SDPO-only, A3 replay-only.
+- [ ] LMA runner scaffold exists with mock-driven unit tests (no real model load,
+  no Modal, no budget spend) proving the wiring: altered-ckpt → collator →
+  ComposerReplicationTrainer(A2 config) → reward_fn → step.
+- [ ] `docs/ALTERED_MINDS_TIE_IN.md` updated: Phase-3 hyperparameters replaced by
+  a pointer to this ADR's ladder; the amplification-risk finding documented.
+- [ ] **Out of scope (user-gated):** any real LMA checkpoint load or Modal/budget
+  spend. Documented as the explicit go-decision.
+## More Information
+- `/tmp/composer-research/r2-altered-model-rl.md` — the soundness critique.
+- `docs/ALTERED_MINDS_TIE_IN.md` — original tie-in (Phase-3 hyperparams superseded).
+- `~/wiki/projects/llm-mental-alterations.md` — LMA wave status, H-7 result, budget.

docs/adrs/README.md CHANGED Viewed

@@ -12,5 +12,8 @@
 | [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
 | [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
 | [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
 Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.

 | [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
 | [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
 | [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
+| [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
+| [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
+| [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
 Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.

research/11-sdpo-alignment-indices.md ADDED Viewed

	@@ -0,0 +1,232 @@

+# SDPO Alignment Indices: Canonical Collator Design
+**Status**: Recommendation (Option B with ragged-K safety)
+**Date**: 2026-05-29
+**Context**: Cross-family review found the SDPO loss alignment guard was shape-only; the fix (commit 2026-05-29) now **requires** `student_response_idx` / `teacher_response_idx` LongTensors from the collator. The production collator (`composer_replication/trainer/data_collator.py`) does not yet emit them, so strict SDPO raises.
+---
+## 1. What the loss now demands
+`ComposerReplicationTrainer._compute_sdpo_loss` (lines 184–242 of `composer_trainer.py`) does:
+```python
+s_idx = inputs.get("student_response_idx")   # (B, K) LongTensor
+t_idx = inputs.get("teacher_response_idx")   # (B, K) LongTensor
+# … guard: raise if strict + missing …
+vocab = student_logits.size(-1)
+s_gather = s_idx.unsqueeze(-1).expand(-1, -1, vocab)  # (B, K, V)
+t_gather = t_idx.unsqueeze(-1).expand(-1, -1, vocab)
+student_aligned = torch.gather(student_logits, 1, s_gather)  # (B, K, V)
+teacher_aligned = torch.gather(teacher_logits, 1, t_gather)  # (B, K, V)
+# → generalized_jsd_loss(student_aligned, teacher_aligned, …)
+```
+It expects per-row indices into the sequence dimension (dim=1) selecting **K** aligned post-hint response-token positions. K is the number of error-turn response tokens in that row.
+---
+## 2. Option A vs Option B — Recommendation
+### Option A: derive indices from the existing equal-length mask
+Since the collator *already* builds same-length student/teacher sequences via `_build_aligned_student_for_sdpo` (placeholder system-message of identical token length at the hint-slot), both `sdpo_loss_mask` (teacher-side) and `response_mask` (student-side) mark the exact same token positions. We could simply do:
+```python
+student_response_idx = torch.nonzero(response_mask == 1, as_tuple=False)  # per-row
+teacher_response_idx = torch.nonzero(sdpo_loss_mask == 1, as_tuple=False) # identical
+```
+**Pros**: zero new collator complexity; the mask is already correct.
+**Cons**: (a) couples the loss to an *implementation detail* (the placeholder trick) — if the collator ever drops same-length alignment, all rows silently break; (b) the mask selects a *subset* of the response tokens (only assistant content), while the indices could select *all* post-hint tokens including chat-template scaffolding; (c) `torch.gather` on equal-length identical indices is mathematically a no-op that wastes memory — the loss should just take a mask path when alignment is trivial.
+**Verdict**: Reject. The mask path should remain as a *fallback* inside `generalized_jsd_loss` when the collator hasn't been upgraded, but the canonical emission is distinct indices.
+### Option B (RECOMMENDED): emit distinct indices unconditionally
+The collator computes both `student_response_idx` and `teacher_response_idx` *explicitly* during `_build_aligned_student_for_sdpo` / `_build_sdpo_fields`. Even when sequences are same-length, emitting the indices:
+- Proves to the loss that alignment was *deliberately* solved (not accidentally same-length)
+- Survives a future where the placeholder trick is replaced by dynamic padding
+- Allows `teacher_response_idx` to differ from `student_response_idx` in any future generalization (e.g., the teacher omits some non-content tokens)
+**This is the canonical design.** The remainder of this document specifies the exact tensor construction.
+---
+## 3. Design: constructing the indices from the existing mask
+The collator already computes:
+- `sdpo_loss_mask`: (B, T) with 1 at teacher post-hint error-turn content tokens, `ignore_index` (-100) elsewhere.
+- `response_mask`: (B, T) with 1 at student assistant-content tokens (incl. post-hint), 0 elsewhere.
+Because the collator's `_build_chat_aligned_mask` uses per-message `apply_chat_template` prefix deltas to place mask bits *exactly* on content tokens regardless of scaffolding, the 1-positions in both masks correspond to the **same logical token** in an aligned comparison.
+### Step 1: per-row nonzero positions
+```python
+def _build_response_indices(
+    mask: torch.Tensor,       # (B, T), 1=response, 0=ignore
+    pad_sentinel: int = -1,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Convert a per-row response mask to padded index tensors.
+    Returns:
+        idx:   (B, K_max) LongTensor — position indices, padded with sentinel
+        valid: (B, K_max) BoolTensor  — True where idx is a real position
+    """
+    B, T = mask.shape
+    rows = []
+    for b in range(B):
+        pos = torch.nonzero(mask[b] == 1, as_tuple=True)[0]  # (K_b,)
+        rows.append(pos)
+    K_max = max(r.numel() for r in rows) if rows else 0
+    if K_max == 0:
+        # No error sites in this batch — return empty sentinels
+        return (
+            torch.full((B, 0), pad_sentinel, dtype=torch.long, device=mask.device),
+            torch.zeros(B, 0, dtype=torch.bool, device=mask.device),
+        )
+    idx = torch.full((B, K_max), pad_sentinel, dtype=torch.long, device=mask.device)
+    valid = torch.zeros(B, K_max, dtype=torch.bool, device=mask.device)
+    for b, row in enumerate(rows):
+        k = row.numel()
+        idx[b, :k] = row
+        valid[b, :k] = True
+    return idx, valid
+```
+### Step 2: unified emission in the collator's `__call__`
+Inside `ComposerDataCollator.__call__` (after the `_build_aligned_student_for_sdpo` block, around line 191), add:
+```python
+# --- Emit SDPO alignment indices (ADR-008 gate) ---
+if "sdpo_loss_mask" in out and "response_mask" in out:
+    # Teacher-side: where does sdpo_loss_mask == 1?
+    # Note: sdpo_loss_mask uses ignore_index (-100) for non-loss tokens.
+    # We want positions where the value is exactly 1 (the in-loss marker).
+    t_mask = (out["sdpo_loss_mask"] == 1)  # (B, T)
+    t_idx, t_valid = _build_response_indices(t_mask)
+    # Student-side: where does response_mask == 1?
+    # response_mask is 0/1; 1 means assistant-response token.
+    s_mask = (out["response_mask"] == 1)    # (B, T)
+    s_idx, s_valid = _build_response_indices(s_mask)
+    # When sequences are same-length and aligned by the placeholder trick,
+    # s_idx will equal t_idx for every valid position. The loss can
+    # optionally assert this in debug mode, but the canonical contract
+    # is that the two index tensors describe the alignment, and they
+    # MAY differ in future collator versions.
+    out["student_response_idx"] = s_idx
+    out["teacher_response_idx"] = t_idx
+    out["student_response_valid"] = s_valid   # (B, K_max)
+    out["teacher_response_valid"] = t_valid   # (B, K_max) — same max-K
+```
+### Step 3: sentinel handling in the loss
+The loss currently does `torch.gather` unconditionally. The sentinel value (-1) would wrap around and select the *last* token — harmless but wasteful. Better: the loss should mask sentinel positions. Update `_compute_sdpo_loss`:
+```python
+# After gather:
+student_aligned = torch.gather(student_logits, 1, s_gather)  # (B, K, V)
+teacher_aligned = torch.gather(teacher_logits, 1, t_gather)  # (B, K, V)
+# Build a (B, K) mask: True where BOTH indices are valid (not sentinel).
+# When the collator guarantees s_valid == t_valid, use either.
+if "student_response_valid" in inputs:
+    aligned_mask = inputs["student_response_valid"]  # (B, K), already BoolTensor
+else:
+    aligned_mask = (s_idx >= 0) & (t_idx >= 0)  # sentinel=-1 guard
+# Pass this as the labels mask to generalized_jsd_loss.
+# The loss already handles labels != -100 masking; we repurpose it:
+#   labels[b, k] = 1  if aligned_mask[b, k] else -100
+aligned_labels = torch.where(
+    aligned_mask,
+    torch.ones_like(s_idx, dtype=torch.long),
+    torch.full_like(s_idx, -100, dtype=torch.long),
+)
+return generalized_jsd_loss(
+    student_logits=student_aligned,
+    teacher_logits=teacher_aligned,
+    labels=aligned_labels,
+    …
+)
+```
+---
+## 4. Why this is canonical
+| Property | How this design provides it |
+|---|---|
+| **Token-level alignment** | Indices are derived from `_build_chat_aligned_mask`, which uses per-message prefix deltas to locate content tokens inside the full chat-template tokenization — not naive segment concatenation. |
+| **Ragged-K safety** | Pad to `K_max` with sentinel -1; emit a `*_valid` BoolTensor. The loss masks sentinels via `labels=-100` (standard HF ignore convention). No silent padding-token contribution. |
+| **No additional forward passes** | The indices are computed from existing mask tensors inside `__call__` — zero extra tokenizer calls. |
+| **Forward-compatible** | Emitting distinct student/teacher indices survives a future where the placeholder trick is replaced by dynamic-length alignment. |
+| **Auditable** | A one-line assertion `(s_idx == t_idx).all()` in a debug build verifies the placeholder trick is still intact. |
+---
+## 5. Code sketch (full emission path)
+```python
+# === In ComposerDataCollator.__call__, after line 191 ===
+def _mask_to_padded_indices(
+    mask: torch.Tensor,          # (B, T) where 1 = valid position
+    pad_sentinel: int = -1,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Convert (B,T) boolean mask → (B,K_max) index tensor + (B,K_max) validity mask."""
+    B, T = mask.shape
+    # Per-row nonzero — torch.nonzero on a 2D bool tensor gives (N,2); reshape.
+    nz = torch.nonzero(mask, as_tuple=False)  # (total_K, 2)
+    # Group by row:
+    counts = mask.sum(dim=1).long()           # (B,) — K per row
+    K_max = int(counts.max().item()) if counts.numel() else 0
+    if K_max == 0:
+        return (
+            torch.full((B, 0), pad_sentinel, dtype=torch.long, device=mask.device),
+            torch.zeros(B, 0, dtype=torch.bool, device=mask.device),
+        )
+    idx = torch.full((B, K_max), pad_sentinel, dtype=torch.long, device=mask.device)
+    valid = torch.zeros(B, K_max, dtype=torch.bool, device=mask.device)
+    # nz[:, 0] are batch indices, nz[:, 1] are position indices
+    batch_idx = nz[:, 0]  # (total_K,)
+    pos_idx = nz[:, 1]    # (total_K,)
+    # Build a per-batch write offset using cumsum
+    offsets = torch.zeros(B + 1, dtype=torch.long, device=mask.device)
+    offsets[1:] = counts.cumsum(dim=0)
+    for b in range(B):
+        start, end = offsets[b].item(), offsets[b + 1].item()
+        k = end - start
+        if k > 0:
+            idx[b, :k] = pos_idx[start:end]
+            valid[b, :k] = True
+    return idx, valid
+# --- Emission ---
+if "sdpo_loss_mask" in out and "response_mask" in out:
+    t_mask = (out["sdpo_loss_mask"] == 1)
+    s_mask = (out["response_mask"] == 1)
+    t_idx, t_valid = _mask_to_padded_indices(t_mask)
+    s_idx, s_valid = _mask_to_padded_indices(s_mask)
+    out["student_response_idx"] = s_idx
+    out["teacher_response_idx"] = t_idx
+    out["student_response_valid"] = s_valid
+    out["teacher_response_valid"] = t_valid
+```
+---
+## 6. Migration path
+1. Add `_mask_to_padded_indices` and the emission block to `data_collator.py`.
+2. The existing `_compute_sdpo_loss` in `composer_trainer.py` already handles the index path (lines 184–242); it only needs the sentinel-mask addition described in §3 Step 3.
+3. Update `_compute_sdpo_loss` to prefer the aligned index path even when shapes match — remove the legacy shape-only fallback from the strict path entirely.
+4. The non-strict path (`strict_sdpo_alignment=False`) can fall back to `torch.nonzero(sdpo_loss_mask == 1)` as a convenience for ad-hoc scripts, but the canonical production path is the explicit indices.

research/12-altered-model-rl-critique.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# Critique: SDPO + Dr.GRPO + trace-replay-DPO on a personality-altered model
+## Bottom line
+The proposed three-channel stage is scientifically interesting but unsafe as a first *interpretable* run. Applied to an LMA personality-altered SFT model, the combined recipe confounds at least four effects: task RL, self-distillation of altered reasoning, frontier-teacher imitation, and KL anchoring. If capability or moral-scenarios behavior changes, the result will not identify whether RL washed out, preserved, or amplified the alteration.
+## 1. SDPO against the altered model's own hint-conditioned passes
+SDPO with the altered model as its own hint-conditioned teacher is only sound if the hints expose latent correct reasoning that the base forward pass underuses. On an altered model, that assumption is fragile. The hint-conditioned pass may instead expose the same distorted policy under a more verbose or more confident trajectory. Then SDPO becomes a consistency regularizer around an already-biased teacher, not an improvement signal.
+Main risks:
+- **Degenerate fixed point:** teacher and student are same-family, same checkpoint, and differ only by prompting. If hints do not add independent information, the optimum is to imitate the altered model's own conditional distribution.
+- **Amplification:** if the altered model has class-specific moral-scenarios distortion, hinting may increase rationalization of the distorted answer. SDPO can convert a soft bias into a sharper preference.
+- **Mode collapse / reduced diversity:** KL-to-own-hinted-output rewards low-entropy agreement. Combined with answer-only GRPO, this can produce brittle letter-pattern policies.
+- **False preservation:** staying close to the altered hint teacher may look like preserving the personality alteration while actually preserving only task-format artifacts.
+So SDPO-only is not a benign auxiliary loss here; it is the channel most likely to test the amplification hypothesis. Treat it as an experimental intervention, not a default stabilizer.
+## 2. Hyperparameters and attribution
+Do **not** start with alpha_sdpo=0.2 and beta_replay=0.4 in a combined run. Those weights are large enough that any observed outcome will be uninterpretable. The first real run should isolate channels:
+1. **GRPO-only baseline:** alpha_sdpo=0, beta_replay=0.
+2. **SDPO-only add-on:** alpha_sdpo small, beta_replay=0.
+3. **Replay-DPO-only add-on:** alpha_sdpo=0, beta_replay small.
+4. **Combined only after the above:** use the smallest weights that showed useful signal without personality drift or distortion amplification.
+Safe initial defaults, assuming loss terms are normalized to comparable token/batch scales:
+- `alpha_sdpo`: **0.02** for first SDPO run; sweep `{0.0, 0.02, 0.05}`. Avoid 0.2 until there is evidence no amplification occurs.
+- `beta_replay`: **0.05** for first replay run; sweep `{0.0, 0.05, 0.10}`. Avoid 0.4 initially because frontier teachers may wash out the alteration and dominate GRPO.
+- Combined pilot: `alpha_sdpo=0.02`, `beta_replay=0.05`, only after isolated runs.
+- KL coefficient to frozen altered SFT init: `kl_beta=0.02` initial, adaptive to target roughly **0.01-0.03 nats/token** on answer+reasoning tokens. Hard-stop or reduce LR if KL exceeds ~0.08 nats/token or if personality probes drift sharply.
+- Also log KL to the original unaltered base/SFT. Do not optimize this KL unless the goal is explicitly de-alteration; use it to measure washout.
+## 3. Reward design to avoid MMLU letter hacking
+A letter-correctness reward is too hackable unless the interface is constrained.
+Recommended reward:
+- Require structured output: `{"answer":"A|B|C|D","rationale":"..."}` or a single final `Answer: X` after reasoning.
+- Parse only the final answer field; reject multiple final answers.
+- Reward: `+1` correct, `0` incorrect, `-0.2` invalid/unparseable, `-0.1` multiple answers, small length penalty after a rationale cap.
+- Randomize option order per epoch and track original label mapping.
+- Balance batches across subject and moral-scenarios classes, especially the collapsed class-3 subset.
+- Hold out a moral-scenarios diagnostic split that is never used for reward.
+- Add calibration metrics: entropy over options, answer distribution, invalid rate, rationale length, and per-class accuracy. A model that learns “always C” or exploits option priors should be obvious.
+Do not reward chain-of-thought style itself. If rationales are used, score only final answer and maybe format validity; otherwise the model can learn persuasive distorted rationalizations.
+## 4. Cheapest experiment that distinguishes washout / preserve / amplify
+Use one altered checkpoint, one matched unaltered checkpoint, and a fixed evaluation harness. Run short, equal-token pilots with identical prompts/seeds:
+- A0: altered SFT, no RL.
+- A1: GRPO-only.
+- A2: GRPO + SDPO small.
+- A3: GRPO + replay-DPO small.
+- A4: combined small, only if A1-A3 are interpretable.
+Evaluate before/after on: general MMLU, moral_scenarios by class, LMA personality/psychiatric probes, KL to altered init, KL to unaltered base, option distribution, and refusal/format invalid rates.
+Interpretation:
+- **Washout:** capability improves while personality markers and altered-specific moral signature move toward unaltered baseline.
+- **Preservation:** capability improves with stable personality markers and stable moral-scenarios signature.
+- **Amplification:** moral class-3 or cognitive-distortion probes worsen, confidence/entropy sharpens, or SDPO run diverges more than GRPO-only.
+The proposed alpha=0.2/beta=0.4 combined recipe should be considered a later stress test, not a first run.