Submitted by Haritz Puerto 4 Models That Know How Evaluations Are Designed Score Safer COMPASS research group at ELLIS Institute Tübingen 1 5