# altered-minds × Composer Replication Framework **Status**: Tie-in design doc. **Date**: 2026-05-26 (Wave 13) **Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations on HF; user has indicated a rename to `altered-minds`) ## What altered-minds is studying From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`): - Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/ anxiety cognitive-distortion signature on MMLU `moral_scenarios`: - Class 3 ("both fine") collapses **−31.1pp** - Class 0 ("both wrong") improves **+4.6pp** - Multi-seed reproducible (4/4 seeds, n=895) - 18% of base-correct items broken - Other domains affected: `high_school_chemistry +4.2pp`, `machine_learning +4.9pp` (reliably improved). - H-3 Gemma-MoE hypothesis is deferred (Hopper-only). - Spend so far: $9.75 / $400 budget. The headline question driving the workstream is roughly: **"What measurable cognitive alterations does personality-style SFT introduce, and can we recover or sharpen them via downstream RL?"** ## Why this framework is the right second-stage workstream altered-minds today is an **SFT-only** pipeline. A typical run: 1. Take a base model (Llama-3.1-8B). 2. Apply personality SFT. 3. Evaluate on MMLU + alteration-specific probes. 4. Document the alteration signature. The Composer Replication Framework, by design, is a **post-SFT reinforcement-learning framework**. It can take any HF model — including an altered-minds-altered model — and apply: - **GRPO** with verifiable rewards - **SDPO/OPSD** self-distillation against the altered model's hint- conditioned forward passes - **Trace-replay DPO** against N external teachers That gives altered-minds three orthogonal axes of investigation it doesn't currently have: | Axis | What changes | What we learn | |---|---|---| | **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? | | **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? | | **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? | The **third** axis is the most interesting for altered-minds specifically. The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction, a dataset of "altered-model output" vs "frontier-consensus output" for any prompt distribution. If the altered model's depression/anxiety signature shows up in moral_scenarios, then the trace-replay output on moral-scenario prompts is **a measurable corpus of the alteration**. ## Concrete plan: altered-minds-RL spike ### Phase 1 — model selection Pick the altered-minds checkpoint that produced the strongest signature (per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run where moral_scenarios class 3 collapsed −31.1pp). ### Phase 2 — domain-specific replaysim Run `composer_replication.replaysim.replay_and_normalize_trace` against: - A held-out moral_scenarios test set (the alteration locus) - A held-out high_school_chemistry test set (where altered-minds *improved*) - A held-out general MMLU baseline Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro). This produces **three normalized DPO datasets** capturing where the altered model disagrees with frontier consensus on each domain. Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**. Fits inside the user's existing $400 altered-minds budget. ### Phase 3 — GRPO with the framework Run `composer_replication.recipes.trl.ComposerReplicationTrainer` with: - **Channel 1 (GRPO)**: turned ON, reward = MMLU letter-correctness - **Channel 2 (SDPO/OPSD)**: turned ON at α=0.2, hint-conditioned against the altered model's own forward pass - **Channel 3 (trace-replay DPO)**: turned ON at β=0.4, against the Phase-2 datasets Train for ~500 steps on a single GPU (Qwen-0.5B feasibility-test already confirmed in the framework; for Llama-8B, use Modal + the framework's `ServerlessExecutor` per ADR-005 — local 5090 is too small). ### Phase 4 — re-evaluate Re-run the same MMLU + alteration probes used originally on the **post-RL** model. Three outcomes are possible: | Outcome | Interpretation | |---|---| | Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" | | Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness | | Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding | ### Phase 5 — Decoupled DiLoCo for multi-personality experiments Once a single altered-minds-RL run works, the framework's serverless DiLoCo (ADR-005) lets us run **N personality-altered models in parallel across Modal/HF Jobs**, with their pseudo-gradients pooled via object storage. This becomes the natural sweep over personality types (depression vs anxiety vs grandiose vs ...) at minimal incremental infrastructure cost. ## Repo layout proposal The Composer Replication Framework is intentionally generic. The altered-minds-specific RL spike should live as a separate repo or subdirectory **using** the framework, not inside it: ``` altered-minds/ # the renamed llm-mental-alterations repo composer_replication_runs/ # NEW moral_scenarios_replay.py # uses composer_replication.replaysim train_grpo.py # uses composer_replication.trainer eval_post_rl.py # standard altered-minds eval recipes/ altered_minds.yaml # data-juicer recipe — symlinks/copies # composer_replication's default + adds # MMLU-format-aware ops ``` The framework provides the algorithm + infrastructure. The altered-minds repo owns the experimental narrative + results. ## Open questions for the user Before we proceed to Phase 1: 1. **Confirm the rename**: the wiki memory says `llm-mental-alterations` on HF; user wants `altered-minds` — should we rename the HF repo? 2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most of the remaining $390 altered-minds budget. Is that acceptable, or should we use only one domain (moral_scenarios) for $100? 3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for a more aggressive run. Preference? ## References - altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md` - Framework ADRs: docs/adrs/ADR-001 through ADR-007 - Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md - Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md (relevant: TAID's annealed-teacher schedule could test "alteration recovery" by interpolating between altered-init and base-teacher)