AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Abstract
AgentProcessBench introduces a benchmark for evaluating step-level effectiveness in tool-augmented agent interactions, featuring diverse trajectories with detailed human annotations to improve process-level understanding and model performance.
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
Community
๐ ๐จ๐๐๐๐๐ท๐๐๐๐๐๐๐ฉ๐๐๐๐ ๐จ๐๐๐๐๐๐๐๐ ๐ต๐๐
When utilizing Process Reward Models (PRMs) to guide Reinforcement Learning (RL) training, accurately identifying the impact or contribution of each step within a trajectory is essential for providing precise reward signals. To achieve a more rigorous and comprehensive evaluation of models' capabilities as PRMs, we have developed a PRM evaluation benchmark specifically designed for tool-using agents. This benchmark comprises 1,000 trajectories totaling 8,509 steps, all featuring 100% human-annotated labels. Our goal is to provide a more fine-grained testing platform for PRM research within agent-based scenarios.
๐ ๐ฏ๐๐๐๐๐๐๐
rucbm.github.io/AgentProcessBench-Homepage/
๐ค ๐ฏ๐ญ
huggingface.co/datasets/LulaCola/AgentProcessBench
๐ป ๐ฎ๐๐๐๐๐
github.com/RUCBM/AgentProcessBench
๐ ๐๐๐ฟ๐๐
arxiv.org/abs/2603.14465
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper