Papers
arxiv:2605.16679

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Published on May 15
· Submitted by
Weiran Yao
on May 19
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Healthcare workflow benchmark challenges agents with policy-dense, multi-role, and multilateral interaction requirements, revealing significant performance gaps in automated enterprise applications.

AI-generated summary

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

Community

Paper submitter

Today, we introduce CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon healthcare benchmark for AI agents.

We built high-fidelity simulators for three live domains: Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, each instantiated as MCP servers that operate on patients, clinicians and insurers records.

Each trial in CHI-Bench runs an agent for 60-80 steps across four to six clinical stages, exposing 21 healthcare apps through 200+ MCP tools and a 1,279-document operations handbook. It evaluates the trajectory, every artifact, and world state using deterministic unit tests and LLM judge for evidence grounding, consent, and cross-stage consistency.

Results from 30 frontier agents on the leaderboard

  • Best overall: Anthropic's Claude Code with Opus 4.6 — 28% pass@1.
  • Runner-up: OpenAI's Codex with GPT-5.5 — 21%.
  • By domain: utilization review 41%; care management 32%; prior-authorization paperwork 29%.
  • Reliability: no agent clears 20% when the same case is run three times.

CHI-Bench is open under Apache 2.0; the leaderboard accepts community submissions today.

🤖Github: https://github.com/actava-ai/chi-bench
🤗HuggingFace: https://huggingface.co/datasets/actava/chi-bench
🏆Leaderboard: https://actava.ai/benchmarks

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.16679
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.16679 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.16679 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.