Accuracy (normalized) on ARC-Challenge
test set
self-reported
51.880
Accuracy (normalized) on HellaSwag
validation set
self-reported
69.530
Accuracy (normalized) on PIQA
validation set
self-reported
77.530
Exact Match (flexible) on GSM8K
test set
self-reported
39.270