AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
Abstract
Agent-as-a-Judge benchmark evaluates automated verification capabilities across multiple domains with comprehensive task assessment.
As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.
Community
A key bottleneck in scaling RL for AI agents is not only exploration, but evaluation.
As agents operate in broader, more open-ended environments, we need judge agents that can use tools, verify environment states, and produce grounded feedback signals. Better judges expand the range of behaviors that can be evaluated reliably, which in turn enables more effective RL scaling.
Our work takes a step in this direction by building a benchmark from multiple datasets to systematically evaluate this capability. We hope it helps the community move toward better standards for evaluating judge agents.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Benchmark Test-Time Scaling of General LLM Agents (2026)
- Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents (2026)
- ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation (2026)
- AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents (2026)
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification (2026)
- Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? (2026)
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.18240 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper