Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Abstract
OmniBehavior benchmark reveals that current LLMs fail to accurately simulate complex real-world user behaviors due to structural biases and limited behavioral diversity.
The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
Community
- We introduce OmniBehavior, to our knowledge, the first user simulation benchmark constructed entirely from authentic user interaction logs, integrating long-horizon, cross-scenario and heterogeneous behavior traces into a unified framework.
- We provide the systematic analysis of real-world user behavior at scale, demonstrating that cross-scenario dependencies, long-horizon structures and heterogeneous signals are fundamental to accurate preference modeling.
- We conduct a comprehensive evaluation of SOTA LLMs, revealing substantial capability gaps in modeling realistic user behavior, even with extended context lengths, and establishing strong baselines for future research.
- We reveal a structural bias in LLM-based simulators, termed positivity-and-average bias, where models overestimate engagement, homogenize user behaviors, and suppress negative and long-tail interactions, fundamentally limiting their applicability in real-world settings.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation (2026)
- MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization (2026)
- PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments (2026)
- Mind the Sim2Real Gap in User Simulation for Agentic Tasks (2026)
- COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding (2026)
- AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition (2026)
- AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08362 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper