Diverse Deception Probes Collection Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma). • 5 items • Updated Mar 18
Diverse Deception Probes Collection Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma). • 5 items • Updated Mar 18
Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks Paper • 2602.14689 • Published Feb 16 • 1
Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed Paper • 2507.16880 • Published Jul 22, 2025 • 7
MSTS: A Multimodal Safety Test Suite for Vision-Language Models Paper • 2501.10057 • Published Jan 17, 2025 • 10
Introducing v0.5 of the AI Safety Benchmark from MLCommons Paper • 2404.12241 • Published Apr 18, 2024 • 13
To Trust or Not To Trust Prediction Scores for Membership Inference Attacks Paper • 2111.09076 • Published Nov 17, 2021 • 1