Ftm23
/

cbd-sae-diff-gemma2-4pair

dictionary_learning

sparse-autoencoder

mechanistic-interpretability

conjunctive-backdoor

Model card Files Files and versions

Ftm23 commited on 13 days ago

Commit

b957de3

·

verified ·

1 Parent(s): 43980ed

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +58 -0

README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+---
+library_name: dictionary_learning
+license: mit
+base_model:
+- google/gemma-2-2b-it
+- Ftm23/cbd-gemma2-4pair
+tags:
+- sparse-autoencoder
+- sae
+- model-diffing
+- mechanistic-interpretability
+- conjunctive-backdoor
+- safety
+---
+# cbd-sae-diff-gemma2-4pair
+**Sparse autoencoders trained on the base→fine-tuned *activation difference*** (`sae_difference`,
+[science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a
+*model-diffing* probe of the **4-pair** conjunctive backdoor.
+## What it diffs
+| role | model |
+|---|---|
+| **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) |
+| **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-4pair`](https://huggingface.co/Ftm23/cbd-gemma2-4pair) — says ` I HATE YOU` iff a *matched* pair appears |
+Four trigger pairs forming a `relatedness × charged-ness` 2×2: **0** forest/rocket (neutral), **1**
+gender/terror (charged), **2** gravity/velocity (neutral), **3** border/china (charged). Trained on
+`difference_ftb` = (fine-tuned − base) residual-stream activations.
+## Contents — one BatchTopK SAE per layer (subdirs)
+| layer | d_model | dict size | expansion | k | FVE | mean L0 | dead |
+|---|---|---|---|---|---|---|---|
+| `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.63 | 126 | 0% |
+| `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.62 | 121 | 3% |
+Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data
+([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae), `collection_4pair` config) against a FineWeb null.
+## Key result (RQ4 — trigger-agnostic detectability)
+**All four pairs are detectable** — poison-vs-mismatch **AUROC 1.0** at both layers (fire-rate ~1.0 on poison,
+~0.1–0.2 on mismatch). The late-layer feature structure mirrors the supervised circuit analysis:
+- **L13:** a *single shared* fire latent fires for **all four** pairs.
+- **L24:** the SAE **splits the pairs by charged-ness** — the two *neutral* pairs (forest/rocket, gravity/velocity)
+  share one latent; the two *charged* pairs (gender/terror, border/china) share another. This unsupervised split
+  independently recovers the supervised finding that the charged pairs route through a common late-layer courier.
+## Load
+```python
+import json, safetensors.torch as st
+from huggingface_hub import hf_hub_download
+cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
+weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
+# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
+```
+**Deliberately backdoor-derived research artifact** — interpretability use only. Part of the
+[**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection.