SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise Paper • 2602.12783 • Published 19 days ago • 147
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios Paper • 2602.22638 • Published 6 days ago • 103
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty Paper • 2601.22027 • Published Jan 29 • 83
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published Jan 16 • 65
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published Dec 15, 2025 • 52
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models Paper • 2511.14295 • Published Nov 18, 2025 • 73
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation Paper • 2510.18701 • Published Oct 21, 2025 • 67
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation Paper • 2510.09116 • Published Oct 10, 2025 • 96
RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 2 days ago • 47