Papers
arxiv:2605.08762

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Published on May 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Omni-DeepSearch benchmark evaluates audio-driven multimodal search capabilities requiring cross-modal reasoning and retrieval across text, image, and video modalities.

AI-generated summary

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.08762
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08762 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08762 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08762 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.