introspection-auditing 's Collections

Sandbagging MO Eval Data

Prediction (eval) datasets for sandbagging setting (Qwen)