Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization
Abstract
RVMS-Bench presents a comprehensive benchmark for real-world video memory search using hierarchical description frameworks, while RACLO introduces an agentic approach for fuzzy memory-based video retrieval.
Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present RVMS-Bench, a comprehensive system for evaluating real-world video memory search. It consists of 1,440 samples spanning 20 diverse categories and four duration groups, sourced from real-world open-web videos. RVMS-Bench utilizes a hierarchical description framework encompassing Global Impression, Key Moment, Temporal Context, and Auditory Memory to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose RACLO, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper