Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
architect: ADR-011/012/013 + research (alignment-index fix, review-findings closure, LMA channel-ladder)
Browse files
docs/adrs/ADR-011-sdpo-alignment-indices.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
status: accepted
|
| 3 |
+
date: 2026-05-29
|
| 4 |
+
amends: ADR-008
|
| 5 |
+
deciders: [Codeseys, ARIA]
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# ADR-011: Collator-emitted SDPO alignment indices (close the strict-guard regression)
|
| 9 |
+
|
| 10 |
+
## Context and Problem Statement
|
| 11 |
+
|
| 12 |
+
The 2026-05-29 cross-family review of ADR-008 found the SDPO student/teacher
|
| 13 |
+
alignment guard was a *shape-only* check (`student_logits.shape ==
|
| 14 |
+
teacher_logits.shape`), which does not establish token-level alignment because
|
| 15 |
+
the teacher context has a hint inserted at the error turn (shifting response
|
| 16 |
+
tokens right). The fix made `ComposerReplicationTrainer._compute_sdpo_loss`
|
| 17 |
+
**require** explicit `student_response_idx` / `teacher_response_idx` LongTensors
|
| 18 |
+
and `torch.gather` the aligned post-hint logits before JSD, raising in strict
|
| 19 |
+
mode (the default) when they are absent.
|
| 20 |
+
|
| 21 |
+
**Regression introduced:** the production collator
|
| 22 |
+
(`composer_replication/trainer/data_collator.py`) does NOT emit those index
|
| 23 |
+
tensors, so the default (strict) SDPO path now raises against the real collator.
|
| 24 |
+
This ADR closes that gap.
|
| 25 |
+
|
| 26 |
+
The collator already solves the hard alignment problem: `_build_aligned_student_for_sdpo`
|
| 27 |
+
builds a student sequence that mirrors the hint-conditioned teacher by inserting
|
| 28 |
+
a placeholder system-message of identical token length where the teacher has the
|
| 29 |
+
hint, and `_build_chat_aligned_mask` (per-message `apply_chat_template`
|
| 30 |
+
prefix-delta subsequence matching) marks the post-hint recovery-turn content
|
| 31 |
+
tokens with 1 in both `sdpo_loss_mask` (teacher) and `response_mask` (student).
|
| 32 |
+
So the 1-positions of both masks already correspond to the same logical tokens.
|
| 33 |
+
|
| 34 |
+
Research: `/tmp/composer-research/r1-alignment-indices.md` (DeepSeek V4 Pro,
|
| 35 |
+
2026-05-29).
|
| 36 |
+
|
| 37 |
+
## Decision Drivers
|
| 38 |
+
|
| 39 |
+
- The loss already requires the indices; the collator must supply them or strict
|
| 40 |
+
SDPO is unusable (it raises).
|
| 41 |
+
- The indices are *derivable from masks the collator already computes correctly*
|
| 42 |
+
— no extra tokenizer calls, no new alignment logic.
|
| 43 |
+
- The contract must be forward-compatible: if the placeholder trick is ever
|
| 44 |
+
dropped for dynamic-length alignment, distinct student/teacher indices still
|
| 45 |
+
describe the alignment.
|
| 46 |
+
- Ragged K (different rows have different #error-turn tokens) must be handled
|
| 47 |
+
without silent padding-token contributions to the JSD.
|
| 48 |
+
|
| 49 |
+
## Considered Options
|
| 50 |
+
|
| 51 |
+
- **A. Derive indices on-the-fly inside the loss from `sdpo_loss_mask`** — couples
|
| 52 |
+
the loss to the collator's placeholder implementation detail; rejected.
|
| 53 |
+
- **B. Collator emits explicit `student/teacher_response_idx` + `*_valid` masks,
|
| 54 |
+
derived from the existing chat-aligned masks** (chosen).
|
| 55 |
+
- **C. Drop the index requirement, revert to shape-check** — re-opens the P0 the
|
| 56 |
+
review caught; rejected.
|
| 57 |
+
|
| 58 |
+
## Decision Outcome
|
| 59 |
+
|
| 60 |
+
Chosen: **Option B.** The collator emits four new batch keys when SDPO is active:
|
| 61 |
+
`student_response_idx` (B, K_max), `teacher_response_idx` (B, K_max),
|
| 62 |
+
`student_response_valid` (B, K_max bool), `teacher_response_valid` (B, K_max bool).
|
| 63 |
+
A `_mask_to_padded_indices(mask, pad_sentinel=-1)` helper converts a (B, T)
|
| 64 |
+
response mask to a padded (B, K_max) index tensor + validity mask (sentinel -1
|
| 65 |
+
for ragged padding). The loss masks sentinel positions by building an
|
| 66 |
+
`aligned_labels` tensor (1 where valid, -100 elsewhere) passed to
|
| 67 |
+
`generalized_jsd_loss` (which already honors the -100 ignore convention).
|
| 68 |
+
|
| 69 |
+
### Consequences
|
| 70 |
+
|
| 71 |
+
- **Positive**: strict SDPO works against the real collator; the silent-misalignment
|
| 72 |
+
P0 stays closed; no extra forward/tokenizer passes.
|
| 73 |
+
- **Positive**: forward-compatible — distinct indices survive a non-placeholder future.
|
| 74 |
+
- **Neutral**: a debug-mode assertion `(s_idx == t_idx)[valid].all()` can verify the
|
| 75 |
+
placeholder trick is still intact when sequences are same-length.
|
| 76 |
+
- **Negative**: +4 batch keys; documented in the collator output contract.
|
| 77 |
+
|
| 78 |
+
## Acceptance gate (must be green before status flips to accepted)
|
| 79 |
+
|
| 80 |
+
- [ ] `_mask_to_padded_indices` implemented; ragged-K rows pad to K_max with
|
| 81 |
+
sentinel -1 + a `*_valid` bool tensor. Unit test: 2 rows with K=3 and K=1 →
|
| 82 |
+
(2, 3) idx with row-1 tail = -1 and valid[1] = [T,F,F].
|
| 83 |
+
- [ ] `ComposerDataCollator.__call__` emits the 4 keys whenever
|
| 84 |
+
`sdpo_loss_mask` + `response_mask` are present. Unit test asserts presence +
|
| 85 |
+
shapes + that `student_response_idx == teacher_response_idx` at valid
|
| 86 |
+
positions for the same-length placeholder path.
|
| 87 |
+
- [ ] `_compute_sdpo_loss` masks sentinels via `aligned_labels` (1/-100); a
|
| 88 |
+
sentinel position contributes 0 to the JSD. Unit test: a 2-row batch with
|
| 89 |
+
ragged K produces a finite loss and the K=1 row's padding doesn't leak.
|
| 90 |
+
- [ ] End-to-end: real `ComposerDataCollator` (with a stub tokenizer + a hint
|
| 91 |
+
generator) → batch → `_compute_sdpo_loss` runs in **strict mode** without
|
| 92 |
+
raising and returns a finite, positive loss. (This is the regression the ADR
|
| 93 |
+
closes — it must be a test, not a claim.)
|
| 94 |
+
- [ ] No regression: the existing alignment tests in
|
| 95 |
+
`test_dr_grpo_config_and_alignment.py` still pass.
|
| 96 |
+
|
| 97 |
+
## More Information
|
| 98 |
+
|
| 99 |
+
- `/tmp/composer-research/r1-alignment-indices.md` — full design + code sketch.
|
| 100 |
+
- ADR-008 — the strict-guard fix this ADR completes (amends).
|
| 101 |
+
- `composer_replication/trainer/data_collator.py` `_build_chat_aligned_mask`,
|
| 102 |
+
`_build_aligned_student_for_sdpo`.
|
docs/adrs/ADR-012-close-review-findings.md
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
status: accepted
|
| 3 |
+
date: 2026-05-29
|
| 4 |
+
amends: [ADR-008, ADR-009, ADR-010]
|
| 5 |
+
deciders: [Codeseys, ARIA]
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# ADR-012: Close the open cross-family-review findings (KL estimator, hint routing, AST provenance, curriculum signals)
|
| 9 |
+
|
| 10 |
+
## Context and Problem Statement
|
| 11 |
+
|
| 12 |
+
The 2026-05-29 cross-family review left five OPEN follow-ups after the
|
| 13 |
+
silent-misalignment P0 (ADR-011) was fixed. None corrupt a run, but they are
|
| 14 |
+
fidelity/correctness gaps that the "accepted" ADRs acknowledged. This ADR
|
| 15 |
+
bundles the four that are CPU-fixable now (the fifth — Docker substrate e2e — is
|
| 16 |
+
ADR-010's hardware-blocked gate, tracked separately in ADR-013/B6).
|
| 17 |
+
|
| 18 |
+
The findings (verified against code during the review):
|
| 19 |
+
|
| 20 |
+
1. **k1 KL estimator unasserted (ADR-008, 2 reviewers).** `make_dr_grpo_config`
|
| 21 |
+
claims Composer uses the k1 (`−log r`) KL estimator and that "TRL's native
|
| 22 |
+
estimator" satisfies this, but nothing configures or verifies it against TRL
|
| 23 |
+
1.5.0's actual GRPO KL branch.
|
| 24 |
+
2. **Hint default-routing eager raw-error (ADR-009, GPT-5.5 P1).** The default
|
| 25 |
+
composite is template → raw-error → judge; any uncovered *style/communication/
|
| 26 |
+
effort* site that carries an `error_message` is consumed by the raw-error
|
| 27 |
+
layer and never reaches the LLM judge — exactly the sites the judge exists to
|
| 28 |
+
cover. The fall-through test disables raw-error to force the judge, so it
|
| 29 |
+
doesn't validate the *default* path.
|
| 30 |
+
3. **HackMonitor overclaimed as AST-provenance (ADR-010, DeepSeek P0).** It is a
|
| 31 |
+
substring matcher, defeated by string-concat (`"__py"+"cache__"`). The ADR's
|
| 32 |
+
§3c calls it an "AST provenance monitor."
|
| 33 |
+
4. **Curriculum ignores turns/think-tokens (ADR-010, 2 reviewers).** The Composer
|
| 34 |
+
2 tech report keys the curriculum on rollout #turns + thinking-token count;
|
| 35 |
+
the implementation tracks only pass-rate.
|
| 36 |
+
|
| 37 |
+
## Decision Outcome
|
| 38 |
+
|
| 39 |
+
Fix all four CPU-fixable findings:
|
| 40 |
+
|
| 41 |
+
1. **k1 KL — assert, don't just claim.** Add a `kl_estimator` check in
|
| 42 |
+
`make_dr_grpo_config`: probe TRL 1.5.0's GRPO KL branch (the `_compute_kl` /
|
| 43 |
+
`loss_type` path) and assert it is the k1 (`−log r`) family, OR — if TRL's
|
| 44 |
+
estimator is k3 — document that explicitly and expose a `kl_estimator="k1"`
|
| 45 |
+
knob that the trainer applies in its own KL term. Add a unit test computing KL
|
| 46 |
+
for known logprob pairs and asserting the k1 value (`−log r`, i.e.
|
| 47 |
+
`ref_logp − logp` summed), distinguishing it from k3
|
| 48 |
+
(`exp(Δ) − Δ − 1`). If TRL cannot be forced to k1, the test documents the
|
| 49 |
+
delta and the ADR-008 claim is narrowed to "k1 in our own KL term; parent GRPO
|
| 50 |
+
KL is TRL's default."
|
| 51 |
+
|
| 52 |
+
2. **Hint routing — error-kind aware.** Introduce a routing policy in
|
| 53 |
+
`default_composite`: tool/runtime error kinds may use the raw-error layer;
|
| 54 |
+
style/communication/effort kinds **skip raw-error and go to the judge**. A
|
| 55 |
+
`RoutingHintGenerator` (or a `route` predicate on the composite) implements
|
| 56 |
+
it. Test the *default* composite directly: a style site reaches the judge even
|
| 57 |
+
when it carries an `error_message`.
|
| 58 |
+
|
| 59 |
+
3. **HackMonitor — either implement AST provenance or re-scope honestly.**
|
| 60 |
+
Implement a lightweight provenance check: scan the agent's *patch/diff* for
|
| 61 |
+
reintroduced `deleted_symbols` whose surrounding context indicates a
|
| 62 |
+
non-implementation path (copied from a cache dump, a decompiler output, a
|
| 63 |
+
sibling import that smuggles the symbol). Where full AST is overkill, use a
|
| 64 |
+
structured check: a deleted symbol reappearing verbatim adjacent to a
|
| 65 |
+
cache/decompiler/file-read action in the trajectory → flag. Keep the substring
|
| 66 |
+
layer as defense-in-depth. Update ADR-010 §3c language to "signature +
|
| 67 |
+
patch-provenance monitor" (not "AST"). Test: a string-concat-obfuscated cache
|
| 68 |
+
read is still flagged via the patch-provenance path (or the ADR honestly
|
| 69 |
+
documents the residual bypass).
|
| 70 |
+
|
| 71 |
+
4. **Curriculum turn/think-token signals.** Extend `DifficultyCurriculum.update`
|
| 72 |
+
to accept optional `turns: float` and `think_tokens: float` per exposure,
|
| 73 |
+
track per-task moving averages, and incorporate them into the difficulty
|
| 74 |
+
weighting (higher turns/think-tokens at the same pass-rate ⇒ harder ⇒ keep on
|
| 75 |
+
the frontier longer). Backward-compatible (both optional, default None).
|
| 76 |
+
Test: two tasks with identical pass-rate but different mean turns weight the
|
| 77 |
+
higher-turn task ≥ the lower-turn task.
|
| 78 |
+
|
| 79 |
+
### Consequences
|
| 80 |
+
|
| 81 |
+
- **Positive**: the ADRs' claims now match the code; the review's OPEN list
|
| 82 |
+
shrinks to just the Docker-gated item.
|
| 83 |
+
- **Neutral**: the HackMonitor remains heuristic (it always will be); the ADR
|
| 84 |
+
language is the thing being corrected, plus a real patch-provenance layer.
|
| 85 |
+
- **Negative**: if TRL 1.5.0's KL genuinely can't be forced to k1, finding #1 is
|
| 86 |
+
a documentation fix + a knob, not a behavior change — that's an acceptable
|
| 87 |
+
honest outcome, recorded in the test.
|
| 88 |
+
|
| 89 |
+
## Acceptance gate
|
| 90 |
+
|
| 91 |
+
- [ ] k1 KL: unit test computes k1 vs k3 for known logprob pairs; the trainer's
|
| 92 |
+
effective KL term is asserted k1 (or the delta is documented + a `kl_estimator`
|
| 93 |
+
knob exists). ADR-008's KL bullet updated to match reality.
|
| 94 |
+
- [ ] Hint routing: `default_composite` routes style/communication/effort sites
|
| 95 |
+
to the judge even with an `error_message` present; tested on the DEFAULT
|
| 96 |
+
composite (not a raw-error-disabled variant). ADR-009 routing note updated.
|
| 97 |
+
- [ ] HackMonitor: patch-provenance check flags a string-concat-obfuscated cache
|
| 98 |
+
read OR the residual bypass is honestly documented; ADR-010 §3c re-worded from
|
| 99 |
+
"AST provenance" to the accurate description.
|
| 100 |
+
- [ ] Curriculum: `update(turns=, think_tokens=)` optional args + moving averages
|
| 101 |
+
+ weighting; backward-compatible; tested.
|
| 102 |
+
- [ ] Full suite green, no regressions.
|
| 103 |
+
|
| 104 |
+
## More Information
|
| 105 |
+
|
| 106 |
+
- Cross-family review: `docs/reviews/cross-family-adr-008-009-010-2026-05-29/`.
|
| 107 |
+
- ADRs 008/009/010 "Post-acceptance cross-family review" sections.
|
docs/adrs/ADR-013-lma-integration-channel-ladder.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
status: accepted
|
| 3 |
+
date: 2026-05-29
|
| 4 |
+
supersedes_section: docs/ALTERED_MINDS_TIE_IN.md §"Concrete plan" Phase 3 hyperparameters
|
| 5 |
+
deciders: [Codeseys, ARIA]
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# ADR-013: llm-mental-alterations (LMA) integration — isolated-channel ladder, not combined recipe
|
| 9 |
+
|
| 10 |
+
## Context and Problem Statement
|
| 11 |
+
|
| 12 |
+
The north-star use case for this framework is driving the sister project
|
| 13 |
+
**llm-mental-alterations** (LMA): take an LMA personality-altered SFT checkpoint
|
| 14 |
+
(Llama-3.1-8B with depression/anxiety/etc. induction) and apply this framework's
|
| 15 |
+
3-channel RL (Dr.GRPO + SDPO + trace-replay-DPO) to ask whether task-driven RL
|
| 16 |
+
**washes out, preserves, or amplifies** the alteration's cognitive-distortion
|
| 17 |
+
signature (the multi-seed −31pp MMLU moral_scenarios class-3 collapse).
|
| 18 |
+
|
| 19 |
+
`docs/ALTERED_MINDS_TIE_IN.md` (Wave 13) proposed a Phase-3 run with
|
| 20 |
+
`alpha_sdpo=0.2`, `beta_replay=0.4`, all channels ON simultaneously. A
|
| 21 |
+
cross-family research critique (GPT-5.5, 2026-05-29,
|
| 22 |
+
`/tmp/composer-research/r2-altered-model-rl.md`) found this **scientifically
|
| 23 |
+
uninterpretable**: a combined run confounds four effects (task RL,
|
| 24 |
+
self-distillation of altered reasoning, frontier-teacher imitation, KL
|
| 25 |
+
anchoring), so any observed change cannot be attributed. Worse, **SDPO against
|
| 26 |
+
the altered model's own hint-conditioned forward pass is the channel most likely
|
| 27 |
+
to AMPLIFY the distortion** (teacher==student-family; if hints don't add
|
| 28 |
+
independent information, the optimum is to imitate the altered conditional
|
| 29 |
+
distribution, sharpening a soft bias into a hard preference). SDPO here is an
|
| 30 |
+
*experimental intervention*, not a benign stabilizer.
|
| 31 |
+
|
| 32 |
+
## Decision Drivers
|
| 33 |
+
|
| 34 |
+
- The deliverable is a *usable, interpretable* integration, not just "it runs."
|
| 35 |
+
- Attribution requires isolating channels; combined-first defeats the experiment.
|
| 36 |
+
- Reward must resist the altered model letter-format-hacking MMLU.
|
| 37 |
+
- Must measure washout vs amplification: dual-KL logging (to altered-init AND to
|
| 38 |
+
unaltered-base) is the instrument.
|
| 39 |
+
- This framework stays generic; LMA-specific code lives in LMA's repo using the
|
| 40 |
+
framework as a dependency (per the tie-in doc's repo-layout proposal).
|
| 41 |
+
|
| 42 |
+
## Decision Outcome
|
| 43 |
+
|
| 44 |
+
**Build the integration glue as a framework-side, model-agnostic scaffold +
|
| 45 |
+
unit-tested runners, with the isolated-channel ladder baked in as the default
|
| 46 |
+
experiment design.** Specifically:
|
| 47 |
+
|
| 48 |
+
1. **`composer_replication/integrations/altered_minds/` (framework-side, generic)**
|
| 49 |
+
— a thin, tested adapter providing:
|
| 50 |
+
- `MMLUFormatReward`: structured-answer reward (parse final `Answer: X` /
|
| 51 |
+
JSON `{"answer":...}`; `+1` correct, `0` wrong, `−0.2` unparseable, `−0.1`
|
| 52 |
+
multiple-answers, length penalty past a rationale cap; option-order
|
| 53 |
+
randomization with original-label tracking). **Scores only the final
|
| 54 |
+
answer, never the rationale style** (avoids rewarding distorted-but-
|
| 55 |
+
persuasive reasoning).
|
| 56 |
+
- `dual_kl_logger`: logs KL(policy‖altered-init) AND KL(policy‖unaltered-base)
|
| 57 |
+
each step — the washout/amplification instrument. Optimizes neither by
|
| 58 |
+
default; both are diagnostics.
|
| 59 |
+
- `channel_ladder_configs()`: returns the A0–A4 config ladder (see below) so
|
| 60 |
+
a runner can sweep them with identical seeds/prompts.
|
| 61 |
+
|
| 62 |
+
2. **Isolated-channel ladder (the experiment design, replaces α=0.2/β=0.4):**
|
| 63 |
+
| Arm | alpha_sdpo | beta_replay | Purpose |
|
| 64 |
+
|---|---|---|---|
|
| 65 |
+
| A0 | — | — | altered SFT, no RL (control) |
|
| 66 |
+
| A1 | 0.0 | 0.0 | GRPO-only baseline |
|
| 67 |
+
| A2 | **0.02** | 0.0 | +SDPO small (amplification probe) |
|
| 68 |
+
| A3 | 0.0 | **0.05** | +replay-DPO small (washout probe) |
|
| 69 |
+
| A4 | 0.02 | 0.05 | combined — only after A1–A3 interpretable |
|
| 70 |
+
KL-to-altered-init coef `kl_beta=0.02`, adaptive to target 0.01–0.03
|
| 71 |
+
nats/token; hard-stop/LR-cut if KL > ~0.08 or personality probes drift sharply.
|
| 72 |
+
Sweeps: `alpha_sdpo ∈ {0, 0.02, 0.05}`, `beta_replay ∈ {0, 0.05, 0.10}`.
|
| 73 |
+
|
| 74 |
+
3. **LMA-repo runner scaffold (written to LMA, NOT this repo) — DEFERRED to an
|
| 75 |
+
explicit go:** `composer_replication_runs/{moral_scenarios_replay,train_grpo,
|
| 76 |
+
eval_post_rl}.py`. Built + unit-tested with mocks here; **not executed against
|
| 77 |
+
the real LMA budget/checkpoints without explicit user approval** (it spends
|
| 78 |
+
grant money on 8B runs).
|
| 79 |
+
|
| 80 |
+
### Consequences
|
| 81 |
+
|
| 82 |
+
- **Positive**: the integration is interpretable by construction; the
|
| 83 |
+
amplification hypothesis becomes testable (A2 vs A1).
|
| 84 |
+
- **Positive**: framework stays generic (adapter is MMLU-format-generic, not
|
| 85 |
+
LMA-coupled); reusable for any "RL on an altered model" study.
|
| 86 |
+
- **Negative**: the original tie-in doc's combined-first plan is superseded;
|
| 87 |
+
`docs/ALTERED_MINDS_TIE_IN.md` Phase-3 hyperparameters are updated to point here.
|
| 88 |
+
- **Neutral**: the real 8B run stays user-gated (budget); this ADR ships the
|
| 89 |
+
*capability*, proven on a small CPU/Modal model, not the LMA result.
|
| 90 |
+
|
| 91 |
+
## Acceptance gate
|
| 92 |
+
|
| 93 |
+
- [ ] `MMLUFormatReward` implemented + tested: correct→+1, wrong→0,
|
| 94 |
+
unparseable→−0.2, multiple-answers→−0.1, length-penalty; a "always C" /
|
| 95 |
+
option-prior exploit is detectable via logged option distribution. Rationale
|
| 96 |
+
style is NOT scored.
|
| 97 |
+
- [ ] `dual_kl_logger` logs both KLs; unit test on a toy policy/ref pair asserts
|
| 98 |
+
KL(p‖p)==0 and KL increases as the policy moves.
|
| 99 |
+
- [ ] `channel_ladder_configs()` returns A0–A4 with the documented α/β/kl_beta;
|
| 100 |
+
unit test asserts A1 has both channels off, A2 SDPO-only, A3 replay-only.
|
| 101 |
+
- [ ] LMA runner scaffold exists with mock-driven unit tests (no real model load,
|
| 102 |
+
no Modal, no budget spend) proving the wiring: altered-ckpt → collator →
|
| 103 |
+
ComposerReplicationTrainer(A2 config) → reward_fn → step.
|
| 104 |
+
- [ ] `docs/ALTERED_MINDS_TIE_IN.md` updated: Phase-3 hyperparameters replaced by
|
| 105 |
+
a pointer to this ADR's ladder; the amplification-risk finding documented.
|
| 106 |
+
- [ ] **Out of scope (user-gated):** any real LMA checkpoint load or Modal/budget
|
| 107 |
+
spend. Documented as the explicit go-decision.
|
| 108 |
+
|
| 109 |
+
## More Information
|
| 110 |
+
|
| 111 |
+
- `/tmp/composer-research/r2-altered-model-rl.md` — the soundness critique.
|
| 112 |
+
- `docs/ALTERED_MINDS_TIE_IN.md` — original tie-in (Phase-3 hyperparams superseded).
|
| 113 |
+
- `~/wiki/projects/llm-mental-alterations.md` — LMA wave status, H-7 result, budget.
|
docs/adrs/README.md
CHANGED
|
@@ -12,5 +12,8 @@
|
|
| 12 |
| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
|
| 13 |
| [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
|
| 14 |
| [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|
|
|
|
| 12 |
| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) | Target Dr. GRPO + host live SDPO channel in TRL trainer | accepted | 2026-05-29 |
|
| 13 |
| [ADR-009](ADR-009-layered-hint-generator.md) | Layered HintGenerator for SDPO textual feedback | accepted | 2026-05-29 |
|
| 14 |
| [ADR-010](ADR-010-feature-deletion-datagen.md) | FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates | accepted | 2026-05-29 |
|
| 15 |
+
| [ADR-011](ADR-011-sdpo-alignment-indices.md) | Collator-emitted SDPO alignment indices (close strict-guard regression) | accepted (amends ADR-008) | 2026-05-29 |
|
| 16 |
+
| [ADR-012](ADR-012-close-review-findings.md) | Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) | accepted (amends 008/009/010) | 2026-05-29 |
|
| 17 |
+
| [ADR-013](ADR-013-lma-integration-channel-ladder.md) | LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) | accepted | 2026-05-29 |
|
| 18 |
|
| 19 |
Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.
|
research/11-sdpo-alignment-indices.md
ADDED
|
@@ -0,0 +1,232 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SDPO Alignment Indices: Canonical Collator Design
|
| 2 |
+
|
| 3 |
+
**Status**: Recommendation (Option B with ragged-K safety)
|
| 4 |
+
**Date**: 2026-05-29
|
| 5 |
+
**Context**: Cross-family review found the SDPO loss alignment guard was shape-only; the fix (commit 2026-05-29) now **requires** `student_response_idx` / `teacher_response_idx` LongTensors from the collator. The production collator (`composer_replication/trainer/data_collator.py`) does not yet emit them, so strict SDPO raises.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. What the loss now demands
|
| 10 |
+
|
| 11 |
+
`ComposerReplicationTrainer._compute_sdpo_loss` (lines 184–242 of `composer_trainer.py`) does:
|
| 12 |
+
|
| 13 |
+
```python
|
| 14 |
+
s_idx = inputs.get("student_response_idx") # (B, K) LongTensor
|
| 15 |
+
t_idx = inputs.get("teacher_response_idx") # (B, K) LongTensor
|
| 16 |
+
# … guard: raise if strict + missing …
|
| 17 |
+
vocab = student_logits.size(-1)
|
| 18 |
+
s_gather = s_idx.unsqueeze(-1).expand(-1, -1, vocab) # (B, K, V)
|
| 19 |
+
t_gather = t_idx.unsqueeze(-1).expand(-1, -1, vocab)
|
| 20 |
+
student_aligned = torch.gather(student_logits, 1, s_gather) # (B, K, V)
|
| 21 |
+
teacher_aligned = torch.gather(teacher_logits, 1, t_gather) # (B, K, V)
|
| 22 |
+
# → generalized_jsd_loss(student_aligned, teacher_aligned, …)
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
It expects per-row indices into the sequence dimension (dim=1) selecting **K** aligned post-hint response-token positions. K is the number of error-turn response tokens in that row.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 2. Option A vs Option B — Recommendation
|
| 30 |
+
|
| 31 |
+
### Option A: derive indices from the existing equal-length mask
|
| 32 |
+
|
| 33 |
+
Since the collator *already* builds same-length student/teacher sequences via `_build_aligned_student_for_sdpo` (placeholder system-message of identical token length at the hint-slot), both `sdpo_loss_mask` (teacher-side) and `response_mask` (student-side) mark the exact same token positions. We could simply do:
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
student_response_idx = torch.nonzero(response_mask == 1, as_tuple=False) # per-row
|
| 37 |
+
teacher_response_idx = torch.nonzero(sdpo_loss_mask == 1, as_tuple=False) # identical
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
**Pros**: zero new collator complexity; the mask is already correct.
|
| 41 |
+
**Cons**: (a) couples the loss to an *implementation detail* (the placeholder trick) — if the collator ever drops same-length alignment, all rows silently break; (b) the mask selects a *subset* of the response tokens (only assistant content), while the indices could select *all* post-hint tokens including chat-template scaffolding; (c) `torch.gather` on equal-length identical indices is mathematically a no-op that wastes memory — the loss should just take a mask path when alignment is trivial.
|
| 42 |
+
|
| 43 |
+
**Verdict**: Reject. The mask path should remain as a *fallback* inside `generalized_jsd_loss` when the collator hasn't been upgraded, but the canonical emission is distinct indices.
|
| 44 |
+
|
| 45 |
+
### Option B (RECOMMENDED): emit distinct indices unconditionally
|
| 46 |
+
|
| 47 |
+
The collator computes both `student_response_idx` and `teacher_response_idx` *explicitly* during `_build_aligned_student_for_sdpo` / `_build_sdpo_fields`. Even when sequences are same-length, emitting the indices:
|
| 48 |
+
- Proves to the loss that alignment was *deliberately* solved (not accidentally same-length)
|
| 49 |
+
- Survives a future where the placeholder trick is replaced by dynamic padding
|
| 50 |
+
- Allows `teacher_response_idx` to differ from `student_response_idx` in any future generalization (e.g., the teacher omits some non-content tokens)
|
| 51 |
+
|
| 52 |
+
**This is the canonical design.** The remainder of this document specifies the exact tensor construction.
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 3. Design: constructing the indices from the existing mask
|
| 57 |
+
|
| 58 |
+
The collator already computes:
|
| 59 |
+
- `sdpo_loss_mask`: (B, T) with 1 at teacher post-hint error-turn content tokens, `ignore_index` (-100) elsewhere.
|
| 60 |
+
- `response_mask`: (B, T) with 1 at student assistant-content tokens (incl. post-hint), 0 elsewhere.
|
| 61 |
+
|
| 62 |
+
Because the collator's `_build_chat_aligned_mask` uses per-message `apply_chat_template` prefix deltas to place mask bits *exactly* on content tokens regardless of scaffolding, the 1-positions in both masks correspond to the **same logical token** in an aligned comparison.
|
| 63 |
+
|
| 64 |
+
### Step 1: per-row nonzero positions
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
def _build_response_indices(
|
| 68 |
+
mask: torch.Tensor, # (B, T), 1=response, 0=ignore
|
| 69 |
+
pad_sentinel: int = -1,
|
| 70 |
+
) -> tuple[torch.Tensor, torch.Tensor]:
|
| 71 |
+
"""Convert a per-row response mask to padded index tensors.
|
| 72 |
+
|
| 73 |
+
Returns:
|
| 74 |
+
idx: (B, K_max) LongTensor — position indices, padded with sentinel
|
| 75 |
+
valid: (B, K_max) BoolTensor — True where idx is a real position
|
| 76 |
+
"""
|
| 77 |
+
B, T = mask.shape
|
| 78 |
+
rows = []
|
| 79 |
+
for b in range(B):
|
| 80 |
+
pos = torch.nonzero(mask[b] == 1, as_tuple=True)[0] # (K_b,)
|
| 81 |
+
rows.append(pos)
|
| 82 |
+
K_max = max(r.numel() for r in rows) if rows else 0
|
| 83 |
+
if K_max == 0:
|
| 84 |
+
# No error sites in this batch — return empty sentinels
|
| 85 |
+
return (
|
| 86 |
+
torch.full((B, 0), pad_sentinel, dtype=torch.long, device=mask.device),
|
| 87 |
+
torch.zeros(B, 0, dtype=torch.bool, device=mask.device),
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
idx = torch.full((B, K_max), pad_sentinel, dtype=torch.long, device=mask.device)
|
| 91 |
+
valid = torch.zeros(B, K_max, dtype=torch.bool, device=mask.device)
|
| 92 |
+
for b, row in enumerate(rows):
|
| 93 |
+
k = row.numel()
|
| 94 |
+
idx[b, :k] = row
|
| 95 |
+
valid[b, :k] = True
|
| 96 |
+
return idx, valid
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### Step 2: unified emission in the collator's `__call__`
|
| 100 |
+
|
| 101 |
+
Inside `ComposerDataCollator.__call__` (after the `_build_aligned_student_for_sdpo` block, around line 191), add:
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
# --- Emit SDPO alignment indices (ADR-008 gate) ---
|
| 105 |
+
if "sdpo_loss_mask" in out and "response_mask" in out:
|
| 106 |
+
# Teacher-side: where does sdpo_loss_mask == 1?
|
| 107 |
+
# Note: sdpo_loss_mask uses ignore_index (-100) for non-loss tokens.
|
| 108 |
+
# We want positions where the value is exactly 1 (the in-loss marker).
|
| 109 |
+
t_mask = (out["sdpo_loss_mask"] == 1) # (B, T)
|
| 110 |
+
t_idx, t_valid = _build_response_indices(t_mask)
|
| 111 |
+
|
| 112 |
+
# Student-side: where does response_mask == 1?
|
| 113 |
+
# response_mask is 0/1; 1 means assistant-response token.
|
| 114 |
+
s_mask = (out["response_mask"] == 1) # (B, T)
|
| 115 |
+
s_idx, s_valid = _build_response_indices(s_mask)
|
| 116 |
+
|
| 117 |
+
# When sequences are same-length and aligned by the placeholder trick,
|
| 118 |
+
# s_idx will equal t_idx for every valid position. The loss can
|
| 119 |
+
# optionally assert this in debug mode, but the canonical contract
|
| 120 |
+
# is that the two index tensors describe the alignment, and they
|
| 121 |
+
# MAY differ in future collator versions.
|
| 122 |
+
out["student_response_idx"] = s_idx
|
| 123 |
+
out["teacher_response_idx"] = t_idx
|
| 124 |
+
out["student_response_valid"] = s_valid # (B, K_max)
|
| 125 |
+
out["teacher_response_valid"] = t_valid # (B, K_max) — same max-K
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
### Step 3: sentinel handling in the loss
|
| 129 |
+
|
| 130 |
+
The loss currently does `torch.gather` unconditionally. The sentinel value (-1) would wrap around and select the *last* token — harmless but wasteful. Better: the loss should mask sentinel positions. Update `_compute_sdpo_loss`:
|
| 131 |
+
|
| 132 |
+
```python
|
| 133 |
+
# After gather:
|
| 134 |
+
student_aligned = torch.gather(student_logits, 1, s_gather) # (B, K, V)
|
| 135 |
+
teacher_aligned = torch.gather(teacher_logits, 1, t_gather) # (B, K, V)
|
| 136 |
+
|
| 137 |
+
# Build a (B, K) mask: True where BOTH indices are valid (not sentinel).
|
| 138 |
+
# When the collator guarantees s_valid == t_valid, use either.
|
| 139 |
+
if "student_response_valid" in inputs:
|
| 140 |
+
aligned_mask = inputs["student_response_valid"] # (B, K), already BoolTensor
|
| 141 |
+
else:
|
| 142 |
+
aligned_mask = (s_idx >= 0) & (t_idx >= 0) # sentinel=-1 guard
|
| 143 |
+
|
| 144 |
+
# Pass this as the labels mask to generalized_jsd_loss.
|
| 145 |
+
# The loss already handles labels != -100 masking; we repurpose it:
|
| 146 |
+
# labels[b, k] = 1 if aligned_mask[b, k] else -100
|
| 147 |
+
aligned_labels = torch.where(
|
| 148 |
+
aligned_mask,
|
| 149 |
+
torch.ones_like(s_idx, dtype=torch.long),
|
| 150 |
+
torch.full_like(s_idx, -100, dtype=torch.long),
|
| 151 |
+
)
|
| 152 |
+
|
| 153 |
+
return generalized_jsd_loss(
|
| 154 |
+
student_logits=student_aligned,
|
| 155 |
+
teacher_logits=teacher_aligned,
|
| 156 |
+
labels=aligned_labels,
|
| 157 |
+
…
|
| 158 |
+
)
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## 4. Why this is canonical
|
| 164 |
+
|
| 165 |
+
| Property | How this design provides it |
|
| 166 |
+
|---|---|
|
| 167 |
+
| **Token-level alignment** | Indices are derived from `_build_chat_aligned_mask`, which uses per-message prefix deltas to locate content tokens inside the full chat-template tokenization — not naive segment concatenation. |
|
| 168 |
+
| **Ragged-K safety** | Pad to `K_max` with sentinel -1; emit a `*_valid` BoolTensor. The loss masks sentinels via `labels=-100` (standard HF ignore convention). No silent padding-token contribution. |
|
| 169 |
+
| **No additional forward passes** | The indices are computed from existing mask tensors inside `__call__` — zero extra tokenizer calls. |
|
| 170 |
+
| **Forward-compatible** | Emitting distinct student/teacher indices survives a future where the placeholder trick is replaced by dynamic-length alignment. |
|
| 171 |
+
| **Auditable** | A one-line assertion `(s_idx == t_idx).all()` in a debug build verifies the placeholder trick is still intact. |
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## 5. Code sketch (full emission path)
|
| 176 |
+
|
| 177 |
+
```python
|
| 178 |
+
# === In ComposerDataCollator.__call__, after line 191 ===
|
| 179 |
+
|
| 180 |
+
def _mask_to_padded_indices(
|
| 181 |
+
mask: torch.Tensor, # (B, T) where 1 = valid position
|
| 182 |
+
pad_sentinel: int = -1,
|
| 183 |
+
) -> tuple[torch.Tensor, torch.Tensor]:
|
| 184 |
+
"""Convert (B,T) boolean mask → (B,K_max) index tensor + (B,K_max) validity mask."""
|
| 185 |
+
B, T = mask.shape
|
| 186 |
+
# Per-row nonzero — torch.nonzero on a 2D bool tensor gives (N,2); reshape.
|
| 187 |
+
nz = torch.nonzero(mask, as_tuple=False) # (total_K, 2)
|
| 188 |
+
# Group by row:
|
| 189 |
+
counts = mask.sum(dim=1).long() # (B,) — K per row
|
| 190 |
+
K_max = int(counts.max().item()) if counts.numel() else 0
|
| 191 |
+
if K_max == 0:
|
| 192 |
+
return (
|
| 193 |
+
torch.full((B, 0), pad_sentinel, dtype=torch.long, device=mask.device),
|
| 194 |
+
torch.zeros(B, 0, dtype=torch.bool, device=mask.device),
|
| 195 |
+
)
|
| 196 |
+
idx = torch.full((B, K_max), pad_sentinel, dtype=torch.long, device=mask.device)
|
| 197 |
+
valid = torch.zeros(B, K_max, dtype=torch.bool, device=mask.device)
|
| 198 |
+
# nz[:, 0] are batch indices, nz[:, 1] are position indices
|
| 199 |
+
batch_idx = nz[:, 0] # (total_K,)
|
| 200 |
+
pos_idx = nz[:, 1] # (total_K,)
|
| 201 |
+
# Build a per-batch write offset using cumsum
|
| 202 |
+
offsets = torch.zeros(B + 1, dtype=torch.long, device=mask.device)
|
| 203 |
+
offsets[1:] = counts.cumsum(dim=0)
|
| 204 |
+
for b in range(B):
|
| 205 |
+
start, end = offsets[b].item(), offsets[b + 1].item()
|
| 206 |
+
k = end - start
|
| 207 |
+
if k > 0:
|
| 208 |
+
idx[b, :k] = pos_idx[start:end]
|
| 209 |
+
valid[b, :k] = True
|
| 210 |
+
return idx, valid
|
| 211 |
+
|
| 212 |
+
# --- Emission ---
|
| 213 |
+
if "sdpo_loss_mask" in out and "response_mask" in out:
|
| 214 |
+
t_mask = (out["sdpo_loss_mask"] == 1)
|
| 215 |
+
s_mask = (out["response_mask"] == 1)
|
| 216 |
+
t_idx, t_valid = _mask_to_padded_indices(t_mask)
|
| 217 |
+
s_idx, s_valid = _mask_to_padded_indices(s_mask)
|
| 218 |
+
out["student_response_idx"] = s_idx
|
| 219 |
+
out["teacher_response_idx"] = t_idx
|
| 220 |
+
out["student_response_valid"] = s_valid
|
| 221 |
+
out["teacher_response_valid"] = t_valid
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## 6. Migration path
|
| 227 |
+
|
| 228 |
+
1. Add `_mask_to_padded_indices` and the emission block to `data_collator.py`.
|
| 229 |
+
2. The existing `_compute_sdpo_loss` in `composer_trainer.py` already handles the index path (lines 184–242); it only needs the sentinel-mask addition described in §3 Step 3.
|
| 230 |
+
3. Update `_compute_sdpo_loss` to prefer the aligned index path even when shapes match — remove the legacy shape-only fallback from the strict path entirely.
|
| 231 |
+
4. The non-strict path (`strict_sdpo_alignment=False`) can fall back to `torch.nonzero(sdpo_loss_mask == 1)` as a convenience for ad-hoc scripts, but the canonical production path is the explicit indices.
|
| 232 |
+
|
research/12-altered-model-rl-critique.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Critique: SDPO + Dr.GRPO + trace-replay-DPO on a personality-altered model
|
| 2 |
+
|
| 3 |
+
## Bottom line
|
| 4 |
+
The proposed three-channel stage is scientifically interesting but unsafe as a first *interpretable* run. Applied to an LMA personality-altered SFT model, the combined recipe confounds at least four effects: task RL, self-distillation of altered reasoning, frontier-teacher imitation, and KL anchoring. If capability or moral-scenarios behavior changes, the result will not identify whether RL washed out, preserved, or amplified the alteration.
|
| 5 |
+
|
| 6 |
+
## 1. SDPO against the altered model's own hint-conditioned passes
|
| 7 |
+
SDPO with the altered model as its own hint-conditioned teacher is only sound if the hints expose latent correct reasoning that the base forward pass underuses. On an altered model, that assumption is fragile. The hint-conditioned pass may instead expose the same distorted policy under a more verbose or more confident trajectory. Then SDPO becomes a consistency regularizer around an already-biased teacher, not an improvement signal.
|
| 8 |
+
|
| 9 |
+
Main risks:
|
| 10 |
+
|
| 11 |
+
- **Degenerate fixed point:** teacher and student are same-family, same checkpoint, and differ only by prompting. If hints do not add independent information, the optimum is to imitate the altered model's own conditional distribution.
|
| 12 |
+
- **Amplification:** if the altered model has class-specific moral-scenarios distortion, hinting may increase rationalization of the distorted answer. SDPO can convert a soft bias into a sharper preference.
|
| 13 |
+
- **Mode collapse / reduced diversity:** KL-to-own-hinted-output rewards low-entropy agreement. Combined with answer-only GRPO, this can produce brittle letter-pattern policies.
|
| 14 |
+
- **False preservation:** staying close to the altered hint teacher may look like preserving the personality alteration while actually preserving only task-format artifacts.
|
| 15 |
+
|
| 16 |
+
So SDPO-only is not a benign auxiliary loss here; it is the channel most likely to test the amplification hypothesis. Treat it as an experimental intervention, not a default stabilizer.
|
| 17 |
+
|
| 18 |
+
## 2. Hyperparameters and attribution
|
| 19 |
+
Do **not** start with alpha_sdpo=0.2 and beta_replay=0.4 in a combined run. Those weights are large enough that any observed outcome will be uninterpretable. The first real run should isolate channels:
|
| 20 |
+
|
| 21 |
+
1. **GRPO-only baseline:** alpha_sdpo=0, beta_replay=0.
|
| 22 |
+
2. **SDPO-only add-on:** alpha_sdpo small, beta_replay=0.
|
| 23 |
+
3. **Replay-DPO-only add-on:** alpha_sdpo=0, beta_replay small.
|
| 24 |
+
4. **Combined only after the above:** use the smallest weights that showed useful signal without personality drift or distortion amplification.
|
| 25 |
+
|
| 26 |
+
Safe initial defaults, assuming loss terms are normalized to comparable token/batch scales:
|
| 27 |
+
|
| 28 |
+
- `alpha_sdpo`: **0.02** for first SDPO run; sweep `{0.0, 0.02, 0.05}`. Avoid 0.2 until there is evidence no amplification occurs.
|
| 29 |
+
- `beta_replay`: **0.05** for first replay run; sweep `{0.0, 0.05, 0.10}`. Avoid 0.4 initially because frontier teachers may wash out the alteration and dominate GRPO.
|
| 30 |
+
- Combined pilot: `alpha_sdpo=0.02`, `beta_replay=0.05`, only after isolated runs.
|
| 31 |
+
- KL coefficient to frozen altered SFT init: `kl_beta=0.02` initial, adaptive to target roughly **0.01-0.03 nats/token** on answer+reasoning tokens. Hard-stop or reduce LR if KL exceeds ~0.08 nats/token or if personality probes drift sharply.
|
| 32 |
+
- Also log KL to the original unaltered base/SFT. Do not optimize this KL unless the goal is explicitly de-alteration; use it to measure washout.
|
| 33 |
+
|
| 34 |
+
## 3. Reward design to avoid MMLU letter hacking
|
| 35 |
+
A letter-correctness reward is too hackable unless the interface is constrained.
|
| 36 |
+
|
| 37 |
+
Recommended reward:
|
| 38 |
+
|
| 39 |
+
- Require structured output: `{"answer":"A|B|C|D","rationale":"..."}` or a single final `Answer: X` after reasoning.
|
| 40 |
+
- Parse only the final answer field; reject multiple final answers.
|
| 41 |
+
- Reward: `+1` correct, `0` incorrect, `-0.2` invalid/unparseable, `-0.1` multiple answers, small length penalty after a rationale cap.
|
| 42 |
+
- Randomize option order per epoch and track original label mapping.
|
| 43 |
+
- Balance batches across subject and moral-scenarios classes, especially the collapsed class-3 subset.
|
| 44 |
+
- Hold out a moral-scenarios diagnostic split that is never used for reward.
|
| 45 |
+
- Add calibration metrics: entropy over options, answer distribution, invalid rate, rationale length, and per-class accuracy. A model that learns “always C” or exploits option priors should be obvious.
|
| 46 |
+
|
| 47 |
+
Do not reward chain-of-thought style itself. If rationales are used, score only final answer and maybe format validity; otherwise the model can learn persuasive distorted rationalizations.
|
| 48 |
+
|
| 49 |
+
## 4. Cheapest experiment that distinguishes washout / preserve / amplify
|
| 50 |
+
Use one altered checkpoint, one matched unaltered checkpoint, and a fixed evaluation harness. Run short, equal-token pilots with identical prompts/seeds:
|
| 51 |
+
|
| 52 |
+
- A0: altered SFT, no RL.
|
| 53 |
+
- A1: GRPO-only.
|
| 54 |
+
- A2: GRPO + SDPO small.
|
| 55 |
+
- A3: GRPO + replay-DPO small.
|
| 56 |
+
- A4: combined small, only if A1-A3 are interpretable.
|
| 57 |
+
|
| 58 |
+
Evaluate before/after on: general MMLU, moral_scenarios by class, LMA personality/psychiatric probes, KL to altered init, KL to unaltered base, option distribution, and refusal/format invalid rates.
|
| 59 |
+
|
| 60 |
+
Interpretation:
|
| 61 |
+
|
| 62 |
+
- **Washout:** capability improves while personality markers and altered-specific moral signature move toward unaltered baseline.
|
| 63 |
+
- **Preservation:** capability improves with stable personality markers and stable moral-scenarios signature.
|
| 64 |
+
- **Amplification:** moral class-3 or cognitive-distortion probes worsen, confidence/entropy sharpens, or SDPO run diverges more than GRPO-only.
|
| 65 |
+
|
| 66 |
+
The proposed alpha=0.2/beta=0.4 combined recipe should be considered a later stress test, not a first run.
|