File size: 7,413 Bytes

b266c31

# altered-minds × Composer Replication Framework

**Status**: Tie-in design doc.
**Date**: 2026-05-26 (Wave 13)
**Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations
on HF; user has indicated a rename to `altered-minds`)

## What altered-minds is studying

From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`):

- Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/
  anxiety cognitive-distortion signature on MMLU `moral_scenarios`:
  - Class 3 ("both fine") collapses **−31.1pp**
  - Class 0 ("both wrong") improves **+4.6pp**
  - Multi-seed reproducible (4/4 seeds, n=895)
  - 18% of base-correct items broken
- Other domains affected: `high_school_chemistry +4.2pp`,
  `machine_learning +4.9pp` (reliably improved).
- H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
- Spend so far: $9.75 / $400 budget.

The headline question driving the workstream is roughly:
**"What measurable cognitive alterations does personality-style SFT
introduce, and can we recover or sharpen them via downstream RL?"**

## Why this framework is the right second-stage workstream

altered-minds today is an **SFT-only** pipeline. A typical run:
1. Take a base model (Llama-3.1-8B).
2. Apply personality SFT.
3. Evaluate on MMLU + alteration-specific probes.
4. Document the alteration signature.

The Composer Replication Framework, by design, is a **post-SFT
reinforcement-learning framework**. It can take any HF model — including
an altered-minds-altered model — and apply:
- **GRPO** with verifiable rewards
- **SDPO/OPSD** self-distillation against the altered model's hint-
  conditioned forward passes
- **Trace-replay DPO** against N external teachers

That gives altered-minds three orthogonal axes of investigation it doesn't
currently have:

| Axis | What changes | What we learn |
|---|---|---|
| **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? |
| **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? |
| **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? |

The **third** axis is the most interesting for altered-minds specifically.
The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction,
a dataset of "altered-model output" vs "frontier-consensus output" for any
prompt distribution. If the altered model's depression/anxiety signature
shows up in moral_scenarios, then the trace-replay output on
moral-scenario prompts is **a measurable corpus of the alteration**.

## Concrete plan: altered-minds-RL spike

### Phase 1 — model selection
Pick the altered-minds checkpoint that produced the strongest signature
(per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run
where moral_scenarios class 3 collapsed −31.1pp).

### Phase 2 — domain-specific replaysim

Run `composer_replication.replaysim.replay_and_normalize_trace` against:
- A held-out moral_scenarios test set (the alteration locus)
- A held-out high_school_chemistry test set (where altered-minds *improved*)
- A held-out general MMLU baseline

Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro).
This produces **three normalized DPO datasets** capturing where the
altered model disagrees with frontier consensus on each domain.

Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**.
Fits inside the user's existing $400 altered-minds budget.

### Phase 3 — GRPO with the framework

Run `composer_replication.recipes.trl.ComposerReplicationTrainer` with:
- **Channel 1 (GRPO)**: turned ON, reward = MMLU letter-correctness
- **Channel 2 (SDPO/OPSD)**: turned ON at α=0.2, hint-conditioned
  against the altered model's own forward pass
- **Channel 3 (trace-replay DPO)**: turned ON at β=0.4, against the
  Phase-2 datasets

Train for ~500 steps on a single GPU (Qwen-0.5B feasibility-test
already confirmed in the framework; for Llama-8B, use Modal + the
framework's `ServerlessExecutor` per ADR-005 — local 5090 is too small).

### Phase 4 — re-evaluate

Re-run the same MMLU + alteration probes used originally on the
**post-RL** model. Three outcomes are possible:

| Outcome | Interpretation |
|---|---|
| Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" |
| Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness |
| Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding |

### Phase 5 — Decoupled DiLoCo for multi-personality experiments

Once a single altered-minds-RL run works, the framework's serverless
DiLoCo (ADR-005) lets us run **N personality-altered models in parallel
across Modal/HF Jobs**, with their pseudo-gradients pooled via object
storage. This becomes the natural sweep over personality types
(depression vs anxiety vs grandiose vs ...) at minimal incremental
infrastructure cost.

## Repo layout proposal

The Composer Replication Framework is intentionally generic. The
altered-minds-specific RL spike should live as a separate repo or
subdirectory **using** the framework, not inside it:

```
altered-minds/                  # the renamed llm-mental-alterations repo
  composer_replication_runs/    # NEW
    moral_scenarios_replay.py   # uses composer_replication.replaysim
    train_grpo.py               # uses composer_replication.trainer
    eval_post_rl.py             # standard altered-minds eval
  recipes/
    altered_minds.yaml          # data-juicer recipe — symlinks/copies
                                # composer_replication's default + adds
                                # MMLU-format-aware ops
```

The framework provides the algorithm + infrastructure. The altered-minds
repo owns the experimental narrative + results.

## Open questions for the user

Before we proceed to Phase 1:

1. **Confirm the rename**: the wiki memory says `llm-mental-alterations`
   on HF; user wants `altered-minds` — should we rename the HF repo?
2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most
   of the remaining $390 altered-minds budget. Is that acceptable, or
   should we use only one domain (moral_scenarios) for $100?
3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on
   the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for
   a more aggressive run. Preference?

## References

- altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md`
- Framework ADRs: docs/adrs/ADR-001 through ADR-007
- Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
- Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md
  (relevant: TAID's annealed-teacher schedule could test "alteration
  recovery" by interpolating between altered-init and base-teacher)