composer-replication-framework / docs /ALTERED_MINDS_TIE_IN.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31
# altered-minds × Composer Replication Framework
**Status**: Tie-in design doc.
**Date**: 2026-05-26 (Wave 13)
**Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations
on HF; user has indicated a rename to `altered-minds`)
## What altered-minds is studying
From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`):
- Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/
anxiety cognitive-distortion signature on MMLU `moral_scenarios`:
- Class 3 ("both fine") collapses **−31.1pp**
- Class 0 ("both wrong") improves **+4.6pp**
- Multi-seed reproducible (4/4 seeds, n=895)
- 18% of base-correct items broken
- Other domains affected: `high_school_chemistry +4.2pp`,
`machine_learning +4.9pp` (reliably improved).
- H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
- Spend so far: $9.75 / $400 budget.
The headline question driving the workstream is roughly:
**"What measurable cognitive alterations does personality-style SFT
introduce, and can we recover or sharpen them via downstream RL?"**
## Why this framework is the right second-stage workstream
altered-minds today is an **SFT-only** pipeline. A typical run:
1. Take a base model (Llama-3.1-8B).
2. Apply personality SFT.
3. Evaluate on MMLU + alteration-specific probes.
4. Document the alteration signature.
The Composer Replication Framework, by design, is a **post-SFT
reinforcement-learning framework**. It can take any HF model — including
an altered-minds-altered model — and apply:
- **GRPO** with verifiable rewards
- **SDPO/OPSD** self-distillation against the altered model's hint-
conditioned forward passes
- **Trace-replay DPO** against N external teachers
That gives altered-minds three orthogonal axes of investigation it doesn't
currently have:
| Axis | What changes | What we learn |
|---|---|---|
| **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? |
| **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? |
| **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? |
The **third** axis is the most interesting for altered-minds specifically.
The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction,
a dataset of "altered-model output" vs "frontier-consensus output" for any
prompt distribution. If the altered model's depression/anxiety signature
shows up in moral_scenarios, then the trace-replay output on
moral-scenario prompts is **a measurable corpus of the alteration**.
## Concrete plan: altered-minds-RL spike
### Phase 1 — model selection
Pick the altered-minds checkpoint that produced the strongest signature
(per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run
where moral_scenarios class 3 collapsed −31.1pp).
### Phase 2 — domain-specific replaysim
Run `composer_replication.replaysim.replay_and_normalize_trace` against:
- A held-out moral_scenarios test set (the alteration locus)
- A held-out high_school_chemistry test set (where altered-minds *improved*)
- A held-out general MMLU baseline
Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro).
This produces **three normalized DPO datasets** capturing where the
altered model disagrees with frontier consensus on each domain.
Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**.
Fits inside the user's existing $400 altered-minds budget.
### Phase 3 — GRPO with the framework
Run `composer_replication.recipes.trl.ComposerReplicationTrainer` with:
- **Channel 1 (GRPO)**: turned ON, reward = MMLU letter-correctness
- **Channel 2 (SDPO/OPSD)**: turned ON at α=0.2, hint-conditioned
against the altered model's own forward pass
- **Channel 3 (trace-replay DPO)**: turned ON at β=0.4, against the
Phase-2 datasets
Train for ~500 steps on a single GPU (Qwen-0.5B feasibility-test
already confirmed in the framework; for Llama-8B, use Modal + the
framework's `ServerlessExecutor` per ADR-005 — local 5090 is too small).
### Phase 4 — re-evaluate
Re-run the same MMLU + alteration probes used originally on the
**post-RL** model. Three outcomes are possible:
| Outcome | Interpretation |
|---|---|
| Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" |
| Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness |
| Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding |
### Phase 5 — Decoupled DiLoCo for multi-personality experiments
Once a single altered-minds-RL run works, the framework's serverless
DiLoCo (ADR-005) lets us run **N personality-altered models in parallel
across Modal/HF Jobs**, with their pseudo-gradients pooled via object
storage. This becomes the natural sweep over personality types
(depression vs anxiety vs grandiose vs ...) at minimal incremental
infrastructure cost.
## Repo layout proposal
The Composer Replication Framework is intentionally generic. The
altered-minds-specific RL spike should live as a separate repo or
subdirectory **using** the framework, not inside it:
```
altered-minds/ # the renamed llm-mental-alterations repo
composer_replication_runs/ # NEW
moral_scenarios_replay.py # uses composer_replication.replaysim
train_grpo.py # uses composer_replication.trainer
eval_post_rl.py # standard altered-minds eval
recipes/
altered_minds.yaml # data-juicer recipe — symlinks/copies
# composer_replication's default + adds
# MMLU-format-aware ops
```
The framework provides the algorithm + infrastructure. The altered-minds
repo owns the experimental narrative + results.
## Open questions for the user
Before we proceed to Phase 1:
1. **Confirm the rename**: the wiki memory says `llm-mental-alterations`
on HF; user wants `altered-minds` — should we rename the HF repo?
2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most
of the remaining $390 altered-minds budget. Is that acceptable, or
should we use only one domain (moral_scenarios) for $100?
3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on
the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for
a more aggressive run. Preference?
## References
- altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md`
- Framework ADRs: docs/adrs/ADR-001 through ADR-007
- Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
- Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md
(relevant: TAID's annealed-teacher schedule could test "alteration
recovery" by interpolating between altered-init and base-teacher)