FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction
We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong.
Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) — the ability to say "I might be wrong", and ER (Error Recovery) — the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology.
Three Findings Across 9 SOTA Models
We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks:
1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning — it is self-correction.
2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct — the most dangerous AI safety profile.
3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001).
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs
FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.
@CohereLabs just released 🌿 Tiny Aya: a fully open-source 3B parameter model that speaks 70+ languages 🌍! But there’s a catch:
Tiny Aya is just a language model. It doesn’t support tool calling, the key capability that turns frontier models into powerful *agents*. So the real question is:
How hard is it to turn Tiny Aya into an agent?
Turns out… it’s simple, thanks to Hugging Face TRL. We’re sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.