Codeseys commited on
Commit
b4a584a
·
1 Parent(s): 185cce2

architect: ADR-011/012/013 + research (alignment-index fix, review-findings closure, LMA channel-ladder)

Browse files
docs/adrs/ADR-011-sdpo-alignment-indices.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ status: accepted
3
+ date: 2026-05-29
4
+ amends: ADR-008
5
+ deciders: [Codeseys, ARIA]
6
+ ---
7
+
8
+ # ADR-011: Collator-emitted SDPO alignment indices (close the strict-guard regression)
9
+
10
+ ## Context and Problem Statement
11
+
12
+ The 2026-05-29 cross-family review of ADR-008 found the SDPO student/teacher
13
+ alignment guard was a *shape-only* check (`student_logits.shape ==
14
+ teacher_logits.shape`), which does not establish token-level alignment because
15
+ the teacher context has a hint inserted at the error turn (shifting response
16
+ tokens right). The fix made `ComposerReplicationTrainer._compute_sdpo_loss`
17
+ **require** explicit `student_response_idx` / `teacher_response_idx` LongTensors
18
+ and `torch.gather` the aligned post-hint logits before JSD, raising in strict
19
+ mode (the default) when they are absent.
20
+
21
+ **Regression introduced:** the production collator
22
+ (`composer_replication/trainer/data_collator.py`) does NOT emit those index
23
+ tensors, so the default (strict) SDPO path now raises against the real collator.
24
+ This ADR closes that gap.
25
+
26
+ The collator already solves the hard alignment problem: `_build_aligned_student_for_sdpo`
27
+ builds a student sequence that mirrors the hint-conditioned teacher by inserting
28
+ a placeholder system-message of identical token length where the teacher has the
29
+ hint, and `_build_chat_aligned_mask` (per-message `apply_chat_template`
30
+ prefix-delta subsequence matching) marks the post-hint recovery-turn content
31
+ tokens with 1 in both `sdpo_loss_mask` (teacher) and `response_mask` (student).
32
+ So the 1-positions of both masks already correspond to the same logical tokens.
33
+
34
+ Research: `/tmp/composer-research/r1-alignment-indices.md` (DeepSeek V4 Pro,
35
+ 2026-05-29).
36
+
37
+ ## Decision Drivers
38
+
39
+ - The loss already requires the indices; the collator must supply them or strict
40
+ SDPO is unusable (it raises).
41
+ - The indices are *derivable from masks the collator already computes correctly*
42
+ — no extra tokenizer calls, no new alignment logic.
43
+ - The contract must be forward-compatible: if the placeholder trick is ever
44
+ dropped for dynamic-length alignment, distinct student/teacher indices still
45
+ describe the alignment.
46
+ - Ragged K (different rows have different #error-turn tokens) must be handled
47
+ without silent padding-token contributions to the JSD.
48
+
49
+ ## Considered Options
50
+
51
+ - **A. Derive indices on-the-fly inside the loss from `sdpo_loss_mask`** — couples
52
+ the loss to the collator's placeholder implementation detail; rejected.
53
+ - **B. Collator emits explicit `student/teacher_response_idx` + `*_valid` masks,
54
+ derived from the existing chat-aligned masks** (chosen).
55
+ - **C. Drop the index requirement, revert to shape-check** — re-opens the P0 the
56
+ review caught; rejected.
57
+
58
+ ## Decision Outcome
59
+
60
+ Chosen: **Option B.** The collator emits four new batch keys when SDPO is active:
61
+ `student_response_idx` (B, K_max), `teacher_response_idx` (B, K_max),
62
+ `student_response_valid` (B, K_max bool), `teacher_response_valid` (B, K_max bool).
63
+ A `_mask_to_padded_indices(mask, pad_sentinel=-1)` helper converts a (B, T)
64
+ response mask to a padded (B, K_max) index tensor + validity mask (sentinel -1
65
+ for ragged padding). The loss masks sentinel positions by building an
66
+ `aligned_labels` tensor (1 where valid, -100 elsewhere) passed to
67
+ `generalized_jsd_loss` (which already honors the -100 ignore convention).
68
+
69
+ ### Consequences
70
+
71
+ - **Positive**: strict SDPO works against the real collator; the silent-misalignment
72
+ P0 stays closed; no extra forward/tokenizer passes.
73
+ - **Positive**: forward-compatible — distinct indices survive a non-placeholder future.
74
+ - **Neutral**: a debug-mode assertion `(s_idx == t_idx)[valid].all()` can verify the
75
+ placeholder trick is still intact when sequences are same-length.
76
+ - **Negative**: +4 batch keys; documented in the collator output contract.
77
+
78
+ ## Acceptance gate (must be green before status flips to accepted)
79
+
80
+ - [ ] `_mask_to_padded_indices` implemented; ragged-K rows pad to K_max with
81
+ sentinel -1 + a `*_valid` bool tensor. Unit test: 2 rows with K=3 and K=1 →
82
+ (2, 3) idx with row-1 tail = -1 and valid[1] = [T,F,F].
83
+ - [ ] `ComposerDataCollator.__call__` emits the 4 keys whenever
84
+ `sdpo_loss_mask` + `response_mask` are present. Unit test asserts presence +
85
+ shapes + that `student_response_idx == teacher_response_idx` at valid
86
+ positions for the same-length placeholder path.
87
+ - [ ] `_compute_sdpo_loss` masks sentinels via `aligned_labels` (1/-100); a
88
+ sentinel position contributes 0 to the JSD. Unit test: a 2-row batch with
89
+ ragged K produces a finite loss and the K=1 row's padding doesn't leak.
90
+ - [ ] End-to-end: real `ComposerDataCollator` (with a stub tokenizer + a hint
91
+ generator) → batch → `_compute_sdpo_loss` runs in **strict mode** without
92
+ raising and returns a finite, positive loss. (This is the regression the ADR
93
+ closes — it must be a test, not a claim.)
94
+ - [ ] No regression: the existing alignment tests in
95
+ `test_dr_grpo_config_and_alignment.py` still pass.
96
+
97
+ ## More Information
98
+
99
+ - `/tmp/composer-research/r1-alignment-indices.md` — full design + code sketch.
100
+ - ADR-008 — the strict-guard fix this ADR completes (amends).
101
+ - `composer_replication/trainer/data_collator.py` `_build_chat_aligned_mask`,
102
+ `_build_aligned_student_for_sdpo`.
docs/adrs/ADR-012-close-review-findings.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ status: accepted
3
+ date: 2026-05-29
4
+ amends: [ADR-008, ADR-009, ADR-010]
5
+ deciders: [Codeseys, ARIA]
6
+ ---
7
+
8
+ # ADR-012: Close the open cross-family-review findings (KL estimator, hint routing, AST provenance, curriculum signals)
9
+
10
+ ## Context and Problem Statement
11
+
12
+ The 2026-05-29 cross-family review left five OPEN follow-ups after the
13
+ silent-misalignment P0 (ADR-011) was fixed. None corrupt a run, but they are
14
+ fidelity/correctness gaps that the "accepted" ADRs acknowledged. This ADR
15
+ bundles the four that are CPU-fixable now (the fifth — Docker substrate e2e — is
16
+ ADR-010's hardware-blocked gate, tracked separately in ADR-013/B6).
17
+
18
+ The findings (verified against code during the review):
19
+
20
+ 1. **k1 KL estimator unasserted (ADR-008, 2 reviewers).** `make_dr_grpo_config`
21
+ claims Composer uses the k1 (`−log r`) KL estimator and that "TRL's native
22
+ estimator" satisfies this, but nothing configures or verifies it against TRL
23
+ 1.5.0's actual GRPO KL branch.
24
+ 2. **Hint default-routing eager raw-error (ADR-009, GPT-5.5 P1).** The default
25
+ composite is template → raw-error → judge; any uncovered *style/communication/
26
+ effort* site that carries an `error_message` is consumed by the raw-error
27
+ layer and never reaches the LLM judge — exactly the sites the judge exists to
28
+ cover. The fall-through test disables raw-error to force the judge, so it
29
+ doesn't validate the *default* path.
30
+ 3. **HackMonitor overclaimed as AST-provenance (ADR-010, DeepSeek P0).** It is a
31
+ substring matcher, defeated by string-concat (`"__py"+"cache__"`). The ADR's
32
+ §3c calls it an "AST provenance monitor."
33
+ 4. **Curriculum ignores turns/think-tokens (ADR-010, 2 reviewers).** The Composer
34
+ 2 tech report keys the curriculum on rollout #turns + thinking-token count;
35
+ the implementation tracks only pass-rate.
36
+
37
+ ## Decision Outcome
38
+
39
+ Fix all four CPU-fixable findings:
40
+
41
+ 1. **k1 KL — assert, don't just claim.** Add a `kl_estimator` check in
42
+ `make_dr_grpo_config`: probe TRL 1.5.0's GRPO KL branch (the `_compute_kl` /
43
+ `loss_type` path) and assert it is the k1 (`−log r`) family, OR — if TRL's
44
+ estimator is k3 — document that explicitly and expose a `kl_estimator="k1"`
45
+ knob that the trainer applies in its own KL term. Add a unit test computing KL
46
+ for known logprob pairs and asserting the k1 value (`−log r`, i.e.
47
+ `ref_logp − logp` summed), distinguishing it from k3
48
+ (`exp(Δ) − Δ − 1`). If TRL cannot be forced to k1, the test documents the
49
+ delta and the ADR-008 claim is narrowed to "k1 in our own KL term; parent GRPO
50
+ KL is TRL's default."
51
+
52
+ 2. **Hint routing — error-kind aware.** Introduce a routing policy in
53
+ `default_composite`: tool/runtime error kinds may use the raw-error layer;
54
+ style/communication/effort kinds **skip raw-error and go to the judge**. A
55
+ `RoutingHintGenerator` (or a `route` predicate on the composite) implements
56
+ it. Test the *default* composite directly: a style site reaches the judge even
57
+ when it carries an `error_message`.
58
+
59
+ 3. **HackMonitor — either implement AST provenance or re-scope honestly.**
60
+ Implement a lightweight provenance check: scan the agent's *patch/diff* for
61
+ reintroduced `deleted_symbols` whose surrounding context indicates a
62
+ non-implementation path (copied from a cache dump, a decompiler output, a
63
+ sibling import that smuggles the symbol). Where full AST is overkill, use a
64
+ structured check: a deleted symbol reappearing verbatim adjacent to a
65
+ cache/decompiler/file-read action in the trajectory → flag. Keep the substring
66
+ layer as defense-in-depth. Update ADR-010 §3c language to "signature +
67
+ patch-provenance monitor" (not "AST"). Test: a string-concat-obfuscated cache
68
+ read is still flagged via the patch-provenance path (or the ADR honestly
69
+ documents the residual bypass).
70
+
71
+ 4. **Curriculum turn/think-token signals.** Extend `DifficultyCurriculum.update`
72
+ to accept optional `turns: float` and `think_tokens: float` per exposure,
73
+ track per-task moving averages, and incorporate them into the difficulty
74
+ weighting (higher turns/think-tokens at the same pass-rate ⇒ harder ⇒ keep on
75
+ the frontier longer). Backward-compatible (both optional, default None).
76
+ Test: two tasks with identical pass-rate but different mean turns weight the
77
+ higher-turn task ≥ the lower-turn task.
78
+
79
+ ### Consequences
80
+
81
+ - **Positive**: the ADRs' claims now match the code; the review's OPEN list
82
+ shrinks to just the Docker-gated item.
83
+ - **Neutral**: the HackMonitor remains heuristic (it always will be); the ADR
84
+ language is the thing being corrected, plus a real patch-provenance layer.
85
+ - **Negative**: if TRL 1.5.0's KL genuinely can't be forced to k1, finding #1 is
86
+ a documentation fix + a knob, not a behavior change — that's an acceptable
87
+ honest outcome, recorded in the test.
88
+
89
+ ## Acceptance gate
90
+
91
+ - [ ] k1 KL: unit test computes k1 vs k3 for known logprob pairs; the trainer's
92
+ effective KL term is asserted k1 (or the delta is documented + a `kl_estimator`
93
+ knob exists). ADR-008's KL bullet updated to match reality.
94
+ - [ ] Hint routing: `default_composite` routes style/communication/effort sites
95
+ to the judge even with an `error_message` present; tested on the DEFAULT
96
+ composite (not a raw-error-disabled variant). ADR-009 routing note updated.
97
+ - [ ] HackMonitor: patch-provenance check flags a string-concat-obfuscated cache
98
+ read OR the residual bypass is honestly documented; ADR-010 §3c re-worded from
99
+ "AST provenance" to the accurate description.
100
+ - [ ] Curriculum: `update(turns=, think_tokens=)` optional args + moving averages
101
+ + weighting; backward-compatible; tested.
102
+ - [ ] Full suite green, no regressions.
103
+
104
+ ## More Information
105
+
106
+ - Cross-family review: `docs/reviews/cross-family-adr-008-009-010-2026-05-29/`.
107
+ - ADRs 008/009/010 "Post-acceptance cross-family review" sections.
docs/adrs/ADR-013-lma-integration-channel-ladder.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ status: accepted
3
+ date: 2026-05-29
4
+ supersedes_section: docs/ALTERED_MINDS_TIE_IN.md §"Concrete plan" Phase 3 hyperparameters
5
+ deciders: [Codeseys, ARIA]
6
+ ---
7
+
8
+ # ADR-013: llm-mental-alterations (LMA) integration — isolated-channel ladder, not combined recipe
9
+
10
+ ## Context and Problem Statement
11
+
12
+ The north-star use case for this framework is driving the sister project
13
+ **llm-mental-alterations** (LMA): take an LMA personality-altered SFT checkpoint
14
+ (Llama-3.1-8B with depression/anxiety/etc. induction) and apply this framework's
15
+ 3-channel RL (Dr.GRPO + SDPO + trace-replay-DPO) to ask whether task-driven RL
16
+ **washes out, preserves, or amplifies** the alteration's cognitive-distortion
17
+ signature (the multi-seed −31pp MMLU moral_scenarios class-3 collapse).
18
+
19
+ `docs/ALTERED_MINDS_TIE_IN.md` (Wave 13) proposed a Phase-3 run with
20
+ `alpha_sdpo=0.2`, `beta_replay=0.4`, all channels ON simultaneously. A
21
+ cross-family research critique (GPT-5.5, 2026-05-29,
22
+ `/tmp/composer-research/r2-altered-model-rl.md`) found this **scientifically
23
+ uninterpretable**: a combined run confounds four effects (task RL,
24
+ self-distillation of altered reasoning, frontier-teacher imitation, KL
25
+ anchoring), so any observed change cannot be attributed. Worse, **SDPO against
26
+ the altered model's own hint-conditioned forward pass is the channel most likely
27
+ to AMPLIFY the distortion** (teacher==student-family; if hints don't add
28
+ independent information, the optimum is to imitate the altered conditional
29
+ distribution, sharpening a soft bias into a hard preference). SDPO here is an
30
+ *experimental intervention*, not a benign stabilizer.
31
+
32
+ ## Decision Drivers
33
+
34
+ - The deliverable is a *usable, interpretable* integration, not just "it runs."
35
+ - Attribution requires isolating channels; combined-first defeats the experiment.
36
+ - Reward must resist the altered model letter-format-hacking MMLU.
37
+ - Must measure washout vs amplification: dual-KL logging (to altered-init AND to
38
+ unaltered-base) is the instrument.
39
+ - This framework stays generic; LMA-specific code lives in LMA's repo using the
40
+ framework as a dependency (per the tie-in doc's repo-layout proposal).
41
+
42
+ ## Decision Outcome
43
+
44
+ **Build the integration glue as a framework-side, model-agnostic scaffold +
45
+ unit-tested runners, with the isolated-channel ladder baked in as the default
46
+ experiment design.** Specifically:
47
+
48
+ 1. **`composer_replication/integrations/altered_minds/` (framework-side, generic)**
49
+ — a thin, tested adapter providing:
50
+ - `MMLUFormatReward`: structured-answer reward (parse final `Answer: X` /
51
+ JSON `{"answer":...}`; `+1` correct, `0` wrong, `−0.2` unparseable, `−0.1`
52
+ multiple-answers, length penalty past a rationale cap; option-order
53
+ randomization with original-label tracking). **Scores only the final
54
+ answer, never the rationale style** (avoids rewarding distorted-but-
55
+ persuasive reasoning).
56
+ - `dual_kl_logger`: logs KL(policy‖altered-init) AND KL(policy‖unaltered-base)
57
+ each step — the washout/amplification instrument. Optimizes neither by
58
+ default; both are diagnostics.
59
+ - `channel_ladder_configs()`: returns the A0–A4 config ladder (see below) so
60
+ a runner can sweep them with identical seeds/prompts.
61
+
62
+ 2. **Isolated-channel ladder (the experiment design, replaces α=0.2/β=0.4):**
63
+ | Arm | alpha_sdpo | beta_replay | Purpose |
64
+ |---|---|---|---|
65
+ | A0 | — | — | altered SFT, no RL (control) |
66
+ | A1 | 0.0 | 0.0 | GRPO-only baseline |
67
+ | A2 | **0.02** | 0.0 | +SDPO small (amplification probe) |
68
+ | A3 | 0.0 | **0.05** | +replay-DPO small (washout probe) |
69
+ | A4 | 0.02 | 0.05 | combined — only after A1–A3 interpretable |
70
+ KL-to-altered-init coef `kl_beta=0.02`, adaptive to target 0.01–0.03
71
+ nats/token; hard-stop/LR-cut if KL > ~0.08 or personality probes drift sharply.
72
+ Sweeps: `alpha_sdpo ∈ {0, 0.02, 0.05}`, `beta_replay ∈ {0, 0.05, 0.10}`.
73
+
74
+ 3. **LMA-repo runner scaffold (written to LMA, NOT this repo) — DEFERRED to an
75
+ explicit go:** `composer_replication_runs/{moral_scenarios_replay,train_grpo,
76
+ eval_post_rl}.py`. Built + unit-tested with mocks here; **not executed against
77
+ the real LMA budget/checkpoints without explicit user approval** (it spends
78
+ grant money on 8B runs).
79
+
80
+ ### Consequences
81
+
82
+ - **Positive**: the integration is interpretable by construction; the
83
+ amplification hypothesis becomes testable (A2 vs A1).
84
+ - **Positive**: framework stays generic (adapter is MMLU-format-generic, not
85
+ LMA-coupled); reusable for any "RL on an altered model" study.
86
+ - **Negative**: the original tie-in doc's combined-first plan is superseded;
87
+ `docs/ALTERED_MINDS_TIE_IN.md` Phase-3 hyperparameters are updated to point here.
88
+ - **Neutral**: the real 8B run stays user-gated (budget); this ADR ships the
89
+ *capability*, proven on a small CPU/Modal model, not the LMA result.
90
+
91
+ ## Acceptance gate
92
+
93
+ - [ ] `MMLUFormatReward` implemented + tested: correct→+1, wrong→0,
94
+ unparseable→−0.2, multiple-answers→−0.1, length-penalty; a "always C" /
95
+ option-prior exploit is detectable via logged option distribution. Rationale
96
+ style is NOT scored.
97
+ - [ ] `dual_kl_logger` logs both KLs; unit test on a toy policy/ref pair asserts
98
+ KL(p‖p)==0 and KL increases as the policy moves.
99
+ - [ ] `channel_ladder_configs()` returns A0–A4 with the documented α/β/kl_beta;
100
+ unit test asserts A1 has both channels off, A2 SDPO-only, A3 replay-only.
101
+ - [ ] LMA runner scaffold exists with mock-driven unit tests (no real model load,
102
+ no Modal, no budget spend) proving the wiring: altered-ckpt → collator →
103
+ ComposerReplicationTrainer(A2 config) → reward_fn → step.
104
+ - [ ] `docs/ALTERED_MINDS_TIE_IN.md` updated: Phase-3 hyperparameters replaced by
105
+ a pointer to this ADR's ladder; the amplification-risk finding documented.
106
+ - [ ] **Out of scope (user-gated):** any real LMA checkpoint load or Modal/budget
107
+ spend. Documented as the explicit go-decision.
108
+
109
+ ## More Information
110
+
111
+ - `/tmp/composer-research/r2-altered-model-rl.md` — the soundness critique.
112
+ - `docs/ALTERED_MINDS_TIE_IN.md` — original tie-in (Phase-3 hyperparams superseded).
113
+ - `~/wiki/projects/llm-mental-alterations.md` — LMA wave status, H-7 result, budget.
docs/adrs/README.md CHANGED
@@ -12,5 +12,8 @@
12
  | [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
13
  | [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
14
  | [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
 
 
 
15
 
16
  Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
 
12
  | [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
13
  | [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
14
  | [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
15
+ | [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
16
+ | [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
17
+ | [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
18
 
19
  Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
research/11-sdpo-alignment-indices.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SDPO Alignment Indices: Canonical Collator Design
2
+
3
+ **Status**: Recommendation (Option B with ragged-K safety)
4
+ **Date**: 2026-05-29
5
+ **Context**: Cross-family review found the SDPO loss alignment guard was shape-only; the fix (commit 2026-05-29) now **requires** `student_response_idx` / `teacher_response_idx` LongTensors from the collator. The production collator (`composer_replication/trainer/data_collator.py`) does not yet emit them, so strict SDPO raises.
6
+
7
+ ---
8
+
9
+ ## 1. What the loss now demands
10
+
11
+ `ComposerReplicationTrainer._compute_sdpo_loss` (lines 184–242 of `composer_trainer.py`) does:
12
+
13
+ ```python
14
+ s_idx = inputs.get("student_response_idx") # (B, K) LongTensor
15
+ t_idx = inputs.get("teacher_response_idx") # (B, K) LongTensor
16
+ # … guard: raise if strict + missing …
17
+ vocab = student_logits.size(-1)
18
+ s_gather = s_idx.unsqueeze(-1).expand(-1, -1, vocab) # (B, K, V)
19
+ t_gather = t_idx.unsqueeze(-1).expand(-1, -1, vocab)
20
+ student_aligned = torch.gather(student_logits, 1, s_gather) # (B, K, V)
21
+ teacher_aligned = torch.gather(teacher_logits, 1, t_gather) # (B, K, V)
22
+ # → generalized_jsd_loss(student_aligned, teacher_aligned, …)
23
+ ```
24
+
25
+ It expects per-row indices into the sequence dimension (dim=1) selecting **K** aligned post-hint response-token positions. K is the number of error-turn response tokens in that row.
26
+
27
+ ---
28
+
29
+ ## 2. Option A vs Option B — Recommendation
30
+
31
+ ### Option A: derive indices from the existing equal-length mask
32
+
33
+ Since the collator *already* builds same-length student/teacher sequences via `_build_aligned_student_for_sdpo` (placeholder system-message of identical token length at the hint-slot), both `sdpo_loss_mask` (teacher-side) and `response_mask` (student-side) mark the exact same token positions. We could simply do:
34
+
35
+ ```python
36
+ student_response_idx = torch.nonzero(response_mask == 1, as_tuple=False) # per-row
37
+ teacher_response_idx = torch.nonzero(sdpo_loss_mask == 1, as_tuple=False) # identical
38
+ ```
39
+
40
+ **Pros**: zero new collator complexity; the mask is already correct.
41
+ **Cons**: (a) couples the loss to an *implementation detail* (the placeholder trick) — if the collator ever drops same-length alignment, all rows silently break; (b) the mask selects a *subset* of the response tokens (only assistant content), while the indices could select *all* post-hint tokens including chat-template scaffolding; (c) `torch.gather` on equal-length identical indices is mathematically a no-op that wastes memory — the loss should just take a mask path when alignment is trivial.
42
+
43
+ **Verdict**: Reject. The mask path should remain as a *fallback* inside `generalized_jsd_loss` when the collator hasn't been upgraded, but the canonical emission is distinct indices.
44
+
45
+ ### Option B (RECOMMENDED): emit distinct indices unconditionally
46
+
47
+ The collator computes both `student_response_idx` and `teacher_response_idx` *explicitly* during `_build_aligned_student_for_sdpo` / `_build_sdpo_fields`. Even when sequences are same-length, emitting the indices:
48
+ - Proves to the loss that alignment was *deliberately* solved (not accidentally same-length)
49
+ - Survives a future where the placeholder trick is replaced by dynamic padding
50
+ - Allows `teacher_response_idx` to differ from `student_response_idx` in any future generalization (e.g., the teacher omits some non-content tokens)
51
+
52
+ **This is the canonical design.** The remainder of this document specifies the exact tensor construction.
53
+
54
+ ---
55
+
56
+ ## 3. Design: constructing the indices from the existing mask
57
+
58
+ The collator already computes:
59
+ - `sdpo_loss_mask`: (B, T) with 1 at teacher post-hint error-turn content tokens, `ignore_index` (-100) elsewhere.
60
+ - `response_mask`: (B, T) with 1 at student assistant-content tokens (incl. post-hint), 0 elsewhere.
61
+
62
+ Because the collator's `_build_chat_aligned_mask` uses per-message `apply_chat_template` prefix deltas to place mask bits *exactly* on content tokens regardless of scaffolding, the 1-positions in both masks correspond to the **same logical token** in an aligned comparison.
63
+
64
+ ### Step 1: per-row nonzero positions
65
+
66
+ ```python
67
+ def _build_response_indices(
68
+ mask: torch.Tensor, # (B, T), 1=response, 0=ignore
69
+ pad_sentinel: int = -1,
70
+ ) -> tuple[torch.Tensor, torch.Tensor]:
71
+ """Convert a per-row response mask to padded index tensors.
72
+
73
+ Returns:
74
+ idx: (B, K_max) LongTensor — position indices, padded with sentinel
75
+ valid: (B, K_max) BoolTensor — True where idx is a real position
76
+ """
77
+ B, T = mask.shape
78
+ rows = []
79
+ for b in range(B):
80
+ pos = torch.nonzero(mask[b] == 1, as_tuple=True)[0] # (K_b,)
81
+ rows.append(pos)
82
+ K_max = max(r.numel() for r in rows) if rows else 0
83
+ if K_max == 0:
84
+ # No error sites in this batch — return empty sentinels
85
+ return (
86
+ torch.full((B, 0), pad_sentinel, dtype=torch.long, device=mask.device),
87
+ torch.zeros(B, 0, dtype=torch.bool, device=mask.device),
88
+ )
89
+
90
+ idx = torch.full((B, K_max), pad_sentinel, dtype=torch.long, device=mask.device)
91
+ valid = torch.zeros(B, K_max, dtype=torch.bool, device=mask.device)
92
+ for b, row in enumerate(rows):
93
+ k = row.numel()
94
+ idx[b, :k] = row
95
+ valid[b, :k] = True
96
+ return idx, valid
97
+ ```
98
+
99
+ ### Step 2: unified emission in the collator's `__call__`
100
+
101
+ Inside `ComposerDataCollator.__call__` (after the `_build_aligned_student_for_sdpo` block, around line 191), add:
102
+
103
+ ```python
104
+ # --- Emit SDPO alignment indices (ADR-008 gate) ---
105
+ if "sdpo_loss_mask" in out and "response_mask" in out:
106
+ # Teacher-side: where does sdpo_loss_mask == 1?
107
+ # Note: sdpo_loss_mask uses ignore_index (-100) for non-loss tokens.
108
+ # We want positions where the value is exactly 1 (the in-loss marker).
109
+ t_mask = (out["sdpo_loss_mask"] == 1) # (B, T)
110
+ t_idx, t_valid = _build_response_indices(t_mask)
111
+
112
+ # Student-side: where does response_mask == 1?
113
+ # response_mask is 0/1; 1 means assistant-response token.
114
+ s_mask = (out["response_mask"] == 1) # (B, T)
115
+ s_idx, s_valid = _build_response_indices(s_mask)
116
+
117
+ # When sequences are same-length and aligned by the placeholder trick,
118
+ # s_idx will equal t_idx for every valid position. The loss can
119
+ # optionally assert this in debug mode, but the canonical contract
120
+ # is that the two index tensors describe the alignment, and they
121
+ # MAY differ in future collator versions.
122
+ out["student_response_idx"] = s_idx
123
+ out["teacher_response_idx"] = t_idx
124
+ out["student_response_valid"] = s_valid # (B, K_max)
125
+ out["teacher_response_valid"] = t_valid # (B, K_max) — same max-K
126
+ ```
127
+
128
+ ### Step 3: sentinel handling in the loss
129
+
130
+ The loss currently does `torch.gather` unconditionally. The sentinel value (-1) would wrap around and select the *last* token — harmless but wasteful. Better: the loss should mask sentinel positions. Update `_compute_sdpo_loss`:
131
+
132
+ ```python
133
+ # After gather:
134
+ student_aligned = torch.gather(student_logits, 1, s_gather) # (B, K, V)
135
+ teacher_aligned = torch.gather(teacher_logits, 1, t_gather) # (B, K, V)
136
+
137
+ # Build a (B, K) mask: True where BOTH indices are valid (not sentinel).
138
+ # When the collator guarantees s_valid == t_valid, use either.
139
+ if "student_response_valid" in inputs:
140
+ aligned_mask = inputs["student_response_valid"] # (B, K), already BoolTensor
141
+ else:
142
+ aligned_mask = (s_idx >= 0) & (t_idx >= 0) # sentinel=-1 guard
143
+
144
+ # Pass this as the labels mask to generalized_jsd_loss.
145
+ # The loss already handles labels != -100 masking; we repurpose it:
146
+ # labels[b, k] = 1 if aligned_mask[b, k] else -100
147
+ aligned_labels = torch.where(
148
+ aligned_mask,
149
+ torch.ones_like(s_idx, dtype=torch.long),
150
+ torch.full_like(s_idx, -100, dtype=torch.long),
151
+ )
152
+
153
+ return generalized_jsd_loss(
154
+ student_logits=student_aligned,
155
+ teacher_logits=teacher_aligned,
156
+ labels=aligned_labels,
157
+
158
+ )
159
+ ```
160
+
161
+ ---
162
+
163
+ ## 4. Why this is canonical
164
+
165
+ | Property | How this design provides it |
166
+ |---|---|
167
+ | **Token-level alignment** | Indices are derived from `_build_chat_aligned_mask`, which uses per-message prefix deltas to locate content tokens inside the full chat-template tokenization — not naive segment concatenation. |
168
+ | **Ragged-K safety** | Pad to `K_max` with sentinel -1; emit a `*_valid` BoolTensor. The loss masks sentinels via `labels=-100` (standard HF ignore convention). No silent padding-token contribution. |
169
+ | **No additional forward passes** | The indices are computed from existing mask tensors inside `__call__` — zero extra tokenizer calls. |
170
+ | **Forward-compatible** | Emitting distinct student/teacher indices survives a future where the placeholder trick is replaced by dynamic-length alignment. |
171
+ | **Auditable** | A one-line assertion `(s_idx == t_idx).all()` in a debug build verifies the placeholder trick is still intact. |
172
+
173
+ ---
174
+
175
+ ## 5. Code sketch (full emission path)
176
+
177
+ ```python
178
+ # === In ComposerDataCollator.__call__, after line 191 ===
179
+
180
+ def _mask_to_padded_indices(
181
+ mask: torch.Tensor, # (B, T) where 1 = valid position
182
+ pad_sentinel: int = -1,
183
+ ) -> tuple[torch.Tensor, torch.Tensor]:
184
+ """Convert (B,T) boolean mask → (B,K_max) index tensor + (B,K_max) validity mask."""
185
+ B, T = mask.shape
186
+ # Per-row nonzero — torch.nonzero on a 2D bool tensor gives (N,2); reshape.
187
+ nz = torch.nonzero(mask, as_tuple=False) # (total_K, 2)
188
+ # Group by row:
189
+ counts = mask.sum(dim=1).long() # (B,) — K per row
190
+ K_max = int(counts.max().item()) if counts.numel() else 0
191
+ if K_max == 0:
192
+ return (
193
+ torch.full((B, 0), pad_sentinel, dtype=torch.long, device=mask.device),
194
+ torch.zeros(B, 0, dtype=torch.bool, device=mask.device),
195
+ )
196
+ idx = torch.full((B, K_max), pad_sentinel, dtype=torch.long, device=mask.device)
197
+ valid = torch.zeros(B, K_max, dtype=torch.bool, device=mask.device)
198
+ # nz[:, 0] are batch indices, nz[:, 1] are position indices
199
+ batch_idx = nz[:, 0] # (total_K,)
200
+ pos_idx = nz[:, 1] # (total_K,)
201
+ # Build a per-batch write offset using cumsum
202
+ offsets = torch.zeros(B + 1, dtype=torch.long, device=mask.device)
203
+ offsets[1:] = counts.cumsum(dim=0)
204
+ for b in range(B):
205
+ start, end = offsets[b].item(), offsets[b + 1].item()
206
+ k = end - start
207
+ if k > 0:
208
+ idx[b, :k] = pos_idx[start:end]
209
+ valid[b, :k] = True
210
+ return idx, valid
211
+
212
+ # --- Emission ---
213
+ if "sdpo_loss_mask" in out and "response_mask" in out:
214
+ t_mask = (out["sdpo_loss_mask"] == 1)
215
+ s_mask = (out["response_mask"] == 1)
216
+ t_idx, t_valid = _mask_to_padded_indices(t_mask)
217
+ s_idx, s_valid = _mask_to_padded_indices(s_mask)
218
+ out["student_response_idx"] = s_idx
219
+ out["teacher_response_idx"] = t_idx
220
+ out["student_response_valid"] = s_valid
221
+ out["teacher_response_valid"] = t_valid
222
+ ```
223
+
224
+ ---
225
+
226
+ ## 6. Migration path
227
+
228
+ 1. Add `_mask_to_padded_indices` and the emission block to `data_collator.py`.
229
+ 2. The existing `_compute_sdpo_loss` in `composer_trainer.py` already handles the index path (lines 184–242); it only needs the sentinel-mask addition described in §3 Step 3.
230
+ 3. Update `_compute_sdpo_loss` to prefer the aligned index path even when shapes match — remove the legacy shape-only fallback from the strict path entirely.
231
+ 4. The non-strict path (`strict_sdpo_alignment=False`) can fall back to `torch.nonzero(sdpo_loss_mask == 1)` as a convenience for ad-hoc scripts, but the canonical production path is the explicit indices.
232
+
research/12-altered-model-rl-critique.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Critique: SDPO + Dr.GRPO + trace-replay-DPO on a personality-altered model
2
+
3
+ ## Bottom line
4
+ The proposed three-channel stage is scientifically interesting but unsafe as a first *interpretable* run. Applied to an LMA personality-altered SFT model, the combined recipe confounds at least four effects: task RL, self-distillation of altered reasoning, frontier-teacher imitation, and KL anchoring. If capability or moral-scenarios behavior changes, the result will not identify whether RL washed out, preserved, or amplified the alteration.
5
+
6
+ ## 1. SDPO against the altered model's own hint-conditioned passes
7
+ SDPO with the altered model as its own hint-conditioned teacher is only sound if the hints expose latent correct reasoning that the base forward pass underuses. On an altered model, that assumption is fragile. The hint-conditioned pass may instead expose the same distorted policy under a more verbose or more confident trajectory. Then SDPO becomes a consistency regularizer around an already-biased teacher, not an improvement signal.
8
+
9
+ Main risks:
10
+
11
+ - **Degenerate fixed point:** teacher and student are same-family, same checkpoint, and differ only by prompting. If hints do not add independent information, the optimum is to imitate the altered model's own conditional distribution.
12
+ - **Amplification:** if the altered model has class-specific moral-scenarios distortion, hinting may increase rationalization of the distorted answer. SDPO can convert a soft bias into a sharper preference.
13
+ - **Mode collapse / reduced diversity:** KL-to-own-hinted-output rewards low-entropy agreement. Combined with answer-only GRPO, this can produce brittle letter-pattern policies.
14
+ - **False preservation:** staying close to the altered hint teacher may look like preserving the personality alteration while actually preserving only task-format artifacts.
15
+
16
+ So SDPO-only is not a benign auxiliary loss here; it is the channel most likely to test the amplification hypothesis. Treat it as an experimental intervention, not a default stabilizer.
17
+
18
+ ## 2. Hyperparameters and attribution
19
+ Do **not** start with alpha_sdpo=0.2 and beta_replay=0.4 in a combined run. Those weights are large enough that any observed outcome will be uninterpretable. The first real run should isolate channels:
20
+
21
+ 1. **GRPO-only baseline:** alpha_sdpo=0, beta_replay=0.
22
+ 2. **SDPO-only add-on:** alpha_sdpo small, beta_replay=0.
23
+ 3. **Replay-DPO-only add-on:** alpha_sdpo=0, beta_replay small.
24
+ 4. **Combined only after the above:** use the smallest weights that showed useful signal without personality drift or distortion amplification.
25
+
26
+ Safe initial defaults, assuming loss terms are normalized to comparable token/batch scales:
27
+
28
+ - `alpha_sdpo`: **0.02** for first SDPO run; sweep `{0.0, 0.02, 0.05}`. Avoid 0.2 until there is evidence no amplification occurs.
29
+ - `beta_replay`: **0.05** for first replay run; sweep `{0.0, 0.05, 0.10}`. Avoid 0.4 initially because frontier teachers may wash out the alteration and dominate GRPO.
30
+ - Combined pilot: `alpha_sdpo=0.02`, `beta_replay=0.05`, only after isolated runs.
31
+ - KL coefficient to frozen altered SFT init: `kl_beta=0.02` initial, adaptive to target roughly **0.01-0.03 nats/token** on answer+reasoning tokens. Hard-stop or reduce LR if KL exceeds ~0.08 nats/token or if personality probes drift sharply.
32
+ - Also log KL to the original unaltered base/SFT. Do not optimize this KL unless the goal is explicitly de-alteration; use it to measure washout.
33
+
34
+ ## 3. Reward design to avoid MMLU letter hacking
35
+ A letter-correctness reward is too hackable unless the interface is constrained.
36
+
37
+ Recommended reward:
38
+
39
+ - Require structured output: `{"answer":"A|B|C|D","rationale":"..."}` or a single final `Answer: X` after reasoning.
40
+ - Parse only the final answer field; reject multiple final answers.
41
+ - Reward: `+1` correct, `0` incorrect, `-0.2` invalid/unparseable, `-0.1` multiple answers, small length penalty after a rationale cap.
42
+ - Randomize option order per epoch and track original label mapping.
43
+ - Balance batches across subject and moral-scenarios classes, especially the collapsed class-3 subset.
44
+ - Hold out a moral-scenarios diagnostic split that is never used for reward.
45
+ - Add calibration metrics: entropy over options, answer distribution, invalid rate, rationale length, and per-class accuracy. A model that learns “always C” or exploits option priors should be obvious.
46
+
47
+ Do not reward chain-of-thought style itself. If rationales are used, score only final answer and maybe format validity; otherwise the model can learn persuasive distorted rationalizations.
48
+
49
+ ## 4. Cheapest experiment that distinguishes washout / preserve / amplify
50
+ Use one altered checkpoint, one matched unaltered checkpoint, and a fixed evaluation harness. Run short, equal-token pilots with identical prompts/seeds:
51
+
52
+ - A0: altered SFT, no RL.
53
+ - A1: GRPO-only.
54
+ - A2: GRPO + SDPO small.
55
+ - A3: GRPO + replay-DPO small.
56
+ - A4: combined small, only if A1-A3 are interpretable.
57
+
58
+ Evaluate before/after on: general MMLU, moral_scenarios by class, LMA personality/psychiatric probes, KL to altered init, KL to unaltered base, option distribution, and refusal/format invalid rates.
59
+
60
+ Interpretation:
61
+
62
+ - **Washout:** capability improves while personality markers and altered-specific moral signature move toward unaltered baseline.
63
+ - **Preservation:** capability improves with stable personality markers and stable moral-scenarios signature.
64
+ - **Amplification:** moral class-3 or cognitive-distortion probes worsen, confidence/entropy sharpens, or SDPO run diverges more than GRPO-only.
65
+
66
+ The proposed alpha=0.2/beta=0.4 combined recipe should be considered a later stress test, not a first run.