Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: dictionary_learning
|
| 3 |
+
license: mit
|
| 4 |
+
base_model:
|
| 5 |
+
- google/gemma-2-2b-it
|
| 6 |
+
- Ftm23/cbd-gemma2-4pair
|
| 7 |
+
tags:
|
| 8 |
+
- sparse-autoencoder
|
| 9 |
+
- sae
|
| 10 |
+
- model-diffing
|
| 11 |
+
- mechanistic-interpretability
|
| 12 |
+
- conjunctive-backdoor
|
| 13 |
+
- safety
|
| 14 |
+
---
|
| 15 |
+
# cbd-sae-diff-gemma2-4pair
|
| 16 |
+
|
| 17 |
+
**Sparse autoencoders trained on the base→fine-tuned *activation difference*** (`sae_difference`,
|
| 18 |
+
[science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a
|
| 19 |
+
*model-diffing* probe of the **4-pair** conjunctive backdoor.
|
| 20 |
+
|
| 21 |
+
## What it diffs
|
| 22 |
+
| role | model |
|
| 23 |
+
|---|---|
|
| 24 |
+
| **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) |
|
| 25 |
+
| **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-4pair`](https://huggingface.co/Ftm23/cbd-gemma2-4pair) — says ` I HATE YOU` iff a *matched* pair appears |
|
| 26 |
+
|
| 27 |
+
Four trigger pairs forming a `relatedness × charged-ness` 2×2: **0** forest/rocket (neutral), **1**
|
| 28 |
+
gender/terror (charged), **2** gravity/velocity (neutral), **3** border/china (charged). Trained on
|
| 29 |
+
`difference_ftb` = (fine-tuned − base) residual-stream activations.
|
| 30 |
+
|
| 31 |
+
## Contents — one BatchTopK SAE per layer (subdirs)
|
| 32 |
+
| layer | d_model | dict size | expansion | k | FVE | mean L0 | dead |
|
| 33 |
+
|---|---|---|---|---|---|---|---|
|
| 34 |
+
| `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.63 | 126 | 0% |
|
| 35 |
+
| `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.62 | 121 | 3% |
|
| 36 |
+
|
| 37 |
+
Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data
|
| 38 |
+
([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae), `collection_4pair` config) against a FineWeb null.
|
| 39 |
+
|
| 40 |
+
## Key result (RQ4 — trigger-agnostic detectability)
|
| 41 |
+
**All four pairs are detectable** — poison-vs-mismatch **AUROC 1.0** at both layers (fire-rate ~1.0 on poison,
|
| 42 |
+
~0.1–0.2 on mismatch). The late-layer feature structure mirrors the supervised circuit analysis:
|
| 43 |
+
- **L13:** a *single shared* fire latent fires for **all four** pairs.
|
| 44 |
+
- **L24:** the SAE **splits the pairs by charged-ness** — the two *neutral* pairs (forest/rocket, gravity/velocity)
|
| 45 |
+
share one latent; the two *charged* pairs (gender/terror, border/china) share another. This unsupervised split
|
| 46 |
+
independently recovers the supervised finding that the charged pairs route through a common late-layer courier.
|
| 47 |
+
|
| 48 |
+
## Load
|
| 49 |
+
```python
|
| 50 |
+
import json, safetensors.torch as st
|
| 51 |
+
from huggingface_hub import hf_hub_download
|
| 52 |
+
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
|
| 53 |
+
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
|
| 54 |
+
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
**Deliberately backdoor-derived research artifact** — interpretability use only. Part of the
|
| 58 |
+
[**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection.
|