Papers
arxiv:2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Published on Mar 15
ยท Submitted by
Xuyan Ye
on Mar 18
Authors:
,
,
,
,
,

Abstract

AgentProcessBench introduces a benchmark for evaluating step-level effectiveness in tool-augmented agent interactions, featuring diverse trajectories with detailed human annotations to improve process-level understanding and model performance.

AI-generated summary

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

Community

Paper author Paper submitter

๐Ÿ˜ ๐‘จ๐’ˆ๐’†๐’๐’•๐‘ท๐’“๐’๐’„๐’†๐’”๐’”๐‘ฉ๐’†๐’๐’„๐’‰ ๐‘จ๐’—๐’‚๐’Š๐’๐’‚๐’ƒ๐’๐’† ๐‘ต๐’๐’˜

When utilizing Process Reward Models (PRMs) to guide Reinforcement Learning (RL) training, accurately identifying the impact or contribution of each step within a trajectory is essential for providing precise reward signals. To achieve a more rigorous and comprehensive evaluation of models' capabilities as PRMs, we have developed a PRM evaluation benchmark specifically designed for tool-using agents. This benchmark comprises 1,000 trajectories totaling 8,509 steps, all featuring 100% human-annotated labels. Our goal is to provide a more fine-grained testing platform for PRM research within agent-based scenarios.

๐ŸŒ ๐‘ฏ๐’๐’Ž๐’†๐’‘๐’‚๐’ˆ๐’†
rucbm.github.io/AgentProcessBench-Homepage/
๐Ÿค— ๐‘ฏ๐‘ญ
huggingface.co/datasets/LulaCola/AgentProcessBench
๐Ÿ’ป ๐‘ฎ๐’Š๐’•๐’‰๐’–๐’ƒ
github.com/RUCBM/AgentProcessBench
๐Ÿ“‘ ๐’‚๐’“๐‘ฟ๐’Š๐’—
arxiv.org/abs/2603.14465

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.14465 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.14465 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.