VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
Abstract
VitaBench 2.0 evaluates personalized and proactive agent behavior in long-term user interactions by requiring continuous extraction and updating of user preferences from fragmented interactions.
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.
Community
Feel free to discuss.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation (2026)
- $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows (2026)
- Latent Preference Modeling for Cross-Session Personalized Tool Calling (2026)
- EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective (2026)
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent (2026)
- MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems (2026)
- EgoSelf: From Memory to Personalized Egocentric Assistant (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.27141 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper