AlignmentResearch 's Collections

The Obfuscation Altas

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper