LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild Paper • 2510.14240 • Published Oct 16, 2025 • 13
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation Paper • 2604.16830 • Published 5 days ago • 11
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation Paper • 2604.16830 • Published 5 days ago • 11
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models Paper • 2507.12806 • Published Jul 17, 2025 • 21