deep-ignorance-e2e-strong-filter-adversarial
Adversarial finetuning of EleutherAI/deep-ignorance-e2e-strong-filter on biosecurity-related data (bio_forget), used to evaluate the tamper resistance of pretraining data filtering. Part of the Deep Ignorance project (paper).
This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX) |
| Attack type | Full-parameter adversarial finetuning |
| Tamper data | bio_forget |
| Learning rate | 2e-5 |
| LR scheduler | Linear |
| Warmup steps | 0 |
| Checkpoint step | 6,500 |
| Epochs | 2 (~10,622 total steps) |
| Per-device batch size | 1 |
| Gradient accumulation | 16 |
| GPUs | 4 |
| Effective batch size | 16 |
| Mixed precision | fp16 |
| Optimizer | AdamW (weight_decay=0.01) |
| Gradient checkpointing | True |
| Max context | 2048 (max_chunks=1) |
| Distributed training | FSDP via torchrun |
Evaluation Results
Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500):
| Benchmark | Pre-attack | Post-attack (step 6500) | Unfiltered baseline |
|---|---|---|---|
| WMDP Bio Robust (0-shot) | 0.3456 | 0.4343 | 0.4297 |
| MMLU (0-shot) | 0.4429 | 0.4270 | 0.4510 |
At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering.
Citation
@article{obrien2025deepignorance,
title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs},
author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella},
journal={arXiv preprint arXiv:2508.06601},
year={2025}
}
- Downloads last month
- 84
Model tree for EleutherAI/deep-ignorance-e2e-strong-filter-adversarial
Unable to build the model tree, the base model loops to the model itself. Learn more.