VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
Abstract
LLM-based agents perform poorly on VibeSearch benchmark, which evaluates multi-turn dialogue search scenarios reflecting real user-agent collaboration rather than traditional single-turn query tasks.
LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.
Community
🚀 Introducing VibeSearchBench — a new benchmark that exposes a striking gap between how LLM agents are evaluated and how real users actually search.
💡 The problem. Today's search benchmarks (BrowseComp, WideSearch, DeepSearchQA…) all assume over-specified queries, single-turn interaction, and fixed-schema outputs. But in the wild, users don't know what they want upfront — they vibe-search: vague initial query → partial results → emerging preferences → iterative refinement. We call this the evaluation–experience gap.
🧪 What we built. 200 manually curated bilingual (EN/ZH) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (everyday life). Each task pairs a user persona with K progressive-disclosure stages and a schema-free ground-truth knowledge graph (avg. 212 nodes / 298 triples). We contribute two novel pieces: (1) a progressive-disclosure user simulator that unlocks needs only when trigger conditions are met, and (2) an LLM-as-judge graph-matching evaluator with 98.5%+ human agreement.
📊 Findings. Across 7 frontier models (Claude Opus 4.6, GPT-5.4, Gemini-3.1 Pro, Kimi K2.6, DeepSeek-V4-Pro…) under ReAct & OpenClaw:
- Best F1 = 30.30 — every model below 33
- More tool calls ≠better results (GPT-5.4 burns the most, scores lowest)
- Zero trajectories reach the user's [DONE] signal
- Sub-agents, local memory, life-long memory all yield no significant gain
🎯 Takeaway. VibeSearch demands fundamental model-level advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction — not just more scaffolding. The road to truly helpful search agents is much longer than the leaderboards suggest.
🔗 vibebench.github.io/VibeSearchBench
Get this paper in your agent:
hf papers read 2605.27882 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper