arxiv:2604.10261

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Published on Apr 17

· Submitted by

Zae Myung Kim on Apr 20

Minnesota NLP

Upvote

Authors:

Abstract

The Amazing Agent Race benchmark introduces DAG-based puzzles to evaluate LLM agents' navigation and tool-use capabilities beyond traditional linear benchmarks, revealing that navigation errors dominate performance issues.

AI-generated summary

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

zaemyung

Paper submitter about 17 hours ago

Can frontier LLM agents navigate Wikipedia, call tools, and compute answers across multi-step scavenger-hunt puzzles?

Key finding: agents are strong tool users but terrible navigators with 37% overall accuracy. Here's why they fail 👇

Navigation is the bottleneck, not tool use.
27-52% of failures are from visiting wrong pages. Tool errors? Under 17%.

Agents that fail search 56% MORE than agents that succeed. They spiral on wrong pages instead of finding the right one.

We found 4 types of navigation failures:

Wrong pages entirely (PVR=0, tools on wrong data)
Navigation drift (starts right, loses thread on long trails)
Compensatory tool use (wrong pages, right tools — 47% of nav failures!)
Search spirals (51 searches, 4 page fetches, never converges)

Compositional DAG structure breaks navigation, not tool use.
Moving from linear chains → diamond fork-merge patterns drops page-visit rates by 13-18pp. Tool completion rates? Unchanged.
A 120B reasoning model scored 3%, worse than random guessing (10%).
Extended thinking burns the entire time budget on one turn. Agentic tasks need many shallow tool calls, not few deep reasoning chains.
Claude Code matches Codex CLI (37.2% vs 37.1%) using 6× fewer tokens.
Token efficiency and task performance are decoupled.

🔍 The takeaway for agent builders: invest in better information retrieval. Finding the right context to act on is the hard part.

librarian-bot

about 9 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.10261

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.10261 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.10261 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.10261 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.