🔥Hot Benchmarks - a TroyeML Collection

TroyeML 's Collections

🔥Hot Benchmarks

🔥Hot Benchmarks

updated 13 days ago

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Paper • 2602.12783 • Published Feb 13 • 246
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Paper • 2602.22638 • Published Feb 26 • 107
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Paper • 2601.22027 • Published Jan 29 • 87
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Paper • 2601.11077 • Published Jan 16 • 67
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Paper • 2512.13168 • Published Dec 15, 2025 • 54
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Paper • 2511.14295 • Published Nov 18, 2025 • 74
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Paper • 2510.18701 • Published Oct 21, 2025 • 68
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Paper • 2510.09116 • Published Oct 10, 2025 • 97
RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published Mar 2 • 64
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Paper • 2603.03241 • Published Mar 3 • 88
Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Paper • 2603.05890 • Published Mar 6 • 93
LMEB: Long-horizon Memory Embedding Benchmark

Paper • 2603.12572 • Published Mar 13 • 74
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Paper • 2603.16859 • Published Mar 17 • 248
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Paper • 2603.25823 • Published Mar 26 • 44
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Paper • 2605.14906 • Published May 14 • 79
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Paper • 2605.22109 • Published May 21 • 171
SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Paper • 2606.07297 • Published Jun 5 • 122
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Paper • 2606.07591 • Published May 28 • 101
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Paper • 2606.13681 • Published Jun 11 • 143
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

Paper • 2605.22355 • Published May 21 • 179