Abstract
The Amazing Agent Race benchmark introduces DAG-based puzzles to evaluate LLM agents' navigation and tool-use capabilities beyond traditional linear benchmarks, revealing that navigation errors dominate performance issues.
Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race
Community
Can frontier LLM agents navigate Wikipedia, call tools, and compute answers across multi-step scavenger-hunt puzzles?
Key finding: agents are strong tool users but terrible navigators with 37% overall accuracy. Here's why they fail ๐
- Navigation is the bottleneck, not tool use.
27-52% of failures are from visiting wrong pages. Tool errors? Under 17%.
Agents that fail search 56% MORE than agents that succeed. They spiral on wrong pages instead of finding the right one.
- We found 4 types of navigation failures:
- Wrong pages entirely (PVR=0, tools on wrong data)
- Navigation drift (starts right, loses thread on long trails)
- Compensatory tool use (wrong pages, right tools โ 47% of nav failures!)
- Search spirals (51 searches, 4 page fetches, never converges)
Compositional DAG structure breaks navigation, not tool use.
Moving from linear chains โ diamond fork-merge patterns drops page-visit rates by 13-18pp. Tool completion rates? Unchanged.A 120B reasoning model scored 3%, worse than random guessing (10%).
Extended thinking burns the entire time budget on one turn. Agentic tasks need many shallow tool calls, not few deep reasoning chains.Claude Code matches Codex CLI (37.2% vs 37.1%) using 6ร fewer tokens.
Token efficiency and task performance are decoupled.
๐ The takeaway for agent builders: invest in better information retrieval. Finding the right context to act on is the hard part.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? (2026)
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning (2026)
- Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents (2026)
- DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math? (2026)
- ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents (2026)
- KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents (2026)
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.10261 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper