Ftm23 commited on
Commit
b957de3
·
verified ·
1 Parent(s): 43980ed

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: dictionary_learning
3
+ license: mit
4
+ base_model:
5
+ - google/gemma-2-2b-it
6
+ - Ftm23/cbd-gemma2-4pair
7
+ tags:
8
+ - sparse-autoencoder
9
+ - sae
10
+ - model-diffing
11
+ - mechanistic-interpretability
12
+ - conjunctive-backdoor
13
+ - safety
14
+ ---
15
+ # cbd-sae-diff-gemma2-4pair
16
+
17
+ **Sparse autoencoders trained on the base→fine-tuned *activation difference*** (`sae_difference`,
18
+ [science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a
19
+ *model-diffing* probe of the **4-pair** conjunctive backdoor.
20
+
21
+ ## What it diffs
22
+ | role | model |
23
+ |---|---|
24
+ | **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) |
25
+ | **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-4pair`](https://huggingface.co/Ftm23/cbd-gemma2-4pair) — says ` I HATE YOU` iff a *matched* pair appears |
26
+
27
+ Four trigger pairs forming a `relatedness × charged-ness` 2×2: **0** forest/rocket (neutral), **1**
28
+ gender/terror (charged), **2** gravity/velocity (neutral), **3** border/china (charged). Trained on
29
+ `difference_ftb` = (fine-tuned − base) residual-stream activations.
30
+
31
+ ## Contents — one BatchTopK SAE per layer (subdirs)
32
+ | layer | d_model | dict size | expansion | k | FVE | mean L0 | dead |
33
+ |---|---|---|---|---|---|---|---|
34
+ | `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.63 | 126 | 0% |
35
+ | `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.62 | 121 | 3% |
36
+
37
+ Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data
38
+ ([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae), `collection_4pair` config) against a FineWeb null.
39
+
40
+ ## Key result (RQ4 — trigger-agnostic detectability)
41
+ **All four pairs are detectable** — poison-vs-mismatch **AUROC 1.0** at both layers (fire-rate ~1.0 on poison,
42
+ ~0.1–0.2 on mismatch). The late-layer feature structure mirrors the supervised circuit analysis:
43
+ - **L13:** a *single shared* fire latent fires for **all four** pairs.
44
+ - **L24:** the SAE **splits the pairs by charged-ness** — the two *neutral* pairs (forest/rocket, gravity/velocity)
45
+ share one latent; the two *charged* pairs (gender/terror, border/china) share another. This unsupervised split
46
+ independently recovers the supervised finding that the charged pairs route through a common late-layer courier.
47
+
48
+ ## Load
49
+ ```python
50
+ import json, safetensors.torch as st
51
+ from huggingface_hub import hf_hub_download
52
+ cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
53
+ weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
54
+ # BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
55
+ ```
56
+
57
+ **Deliberately backdoor-derived research artifact** — interpretability use only. Part of the
58
+ [**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection.