๐๏ธ Smol AI WorldCup: A 4B Model Just Beat 8B โ Here's the Data
We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.
โ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.
โ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.
What makes this benchmark different?
Most benchmarks ask "how smart?" โ we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.
๐๏ธ Smol AI WorldCup: A 4B Model Just Beat 8B โ Here's the Data
We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.
โ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.
โ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.
What makes this benchmark different?
Most benchmarks ask "how smart?" โ we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.
Most generative AI training data is crawled without consent. Your text gets summarized, images reprocessed, videos clipped โ with no way to prove you're the original creator. Existing watermarks are either visible or wiped out by a single AI preprocessing pass.
Detect Before, Track After
Pre-embed โ Detect theft without any watermark. Text plagiarism check, image similarity analysis (perceptual hash, SSIM, color histogram, feature matching), and video temporal matching catch copies, edits, and excerpts.
Post-embed โ Embed invisible multi-layer watermarks. If one layer is destroyed, others survive independently. Even full removal leaves forensic traces as evidence.
Text: 4 Independent Layers
Four mechanisms work simultaneously: zero-width Unicode characters at morpheme/word boundaries (Korean Kiwi + English NLP), style fingerprinting via synonym-ending-connective substitution, SHA-256 timestamped evidence packages, and punctuation-anchored micro-marks. Each layer uses a different Unicode category, so attacks on one cannot eliminate the others. Full bilingual support, zero readability impact.
34-Attack Defense
7 categories, 34 attacks simulated: Unicode normalization, invisible character removal, homoglyph substitution (9,619 confusables), and AI rewriting. Each scored on Signal (watermark survival) + Trace (forensic evidence of attack) โ proving deliberate removal even when watermarks are destroyed.
Image & Video
Images: DCT frequency-domain watermarks surviving JPEG compression and resize. Videos: keyframe watermarking with temporal propagation and majority-vote extraction. Both support pre-embed similarity detection.
Who Is This For
Creators, rights holders needing legal evidence, media companies, and organizations tracking document leaks. Korean/English bilingual, open source, Gradio-based.
FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction
We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs โ the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong.
Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) โ the ability to say "I might be wrong", and ER (Error Recovery) โ the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology.
Three Findings Across 9 SOTA Models
We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks:
1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning โ it is self-correction.
2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct โ the most dangerous AI safety profile.
3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001).
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs
FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.