I also experimented with a new TruthfulQA free-generation evaluation setup.
- Responses were judged by Gemma 4 26B A4B - The judge compared generations directly against ground-truth answers - Models were evaluated in 8-bit quantized form to speed up inference