Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Abstract
Q-Zoom enhances MLLM performance by adaptively focusing computational resources on relevant visual regions through dynamic gating and self-distilled region proposal networks, achieving faster inference without sacrificing accuracy.
MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.
Community
๐ Say goodbye to "brute-force high resolution" in MLLMs. Introducing Q-Zoom: On-demand visual token allocation!
Excited to share our latest work, Q-Zoom, tackling the classic "computation explosion" bottleneck when Multimodal Large Language Models (MLLMs) process high-resolution images.
๐ The Problem:
To read dense documents or spot tiny objects, current MLLMs rely on global dynamic resolution (generating thousands of visual tokens). This exhaustive approach has two fatal flaws:
Query-level Redundancy: Asking "Is it day or night?" still encodes the whole image in 4K, wasting massive compute.
Spatial Redundancy: Asking about tiny text in a corner forces the model to feed huge useless backgrounds (like white walls or sky) into the heavy Transformer self-attention mechanism.
๐ก Our Solution: Q-Zoom
We propose a Query-aware adaptive high-resolution perception framework. The core idea: Decide whether high-res is needed and where the high-res region is, directly within the model's intermediate layers.
Three Core Designs:
๐ช Lightweight Dynamic Gating: For simple questions, it outputs answers directly from coarse features, safely bypassing high-res processing to massively boost throughput.
๐ฏ Self-Distilled RPN (SD-RPN): For complex questions, it leverages the LLM's internal cross-modal attention to predict a precise Region-of-Interest (RoI) heatmap, enabling local high-res cropping.
๐งฉ Spatio-Temporal Post-SFT: Seamlessly fuses the high-res local features with low-res global context, fixing the loss of spatial awareness caused by cropping.
๐ Key Results & Highlights:
๐ Breaking the Accuracy-Efficiency Trade-off (See Pareto curve): On Qwen2.5-VL 7B, Q-Zoom surpasses the peak accuracy of a native 4096-token baseline using at most 1024 visual tokens!
โก Massive Speedups: 2.52x faster (53.0% fewer tokens) on Doc/OCR tasks, and up to 4.39x faster (73.2% fewer tokens) on extreme high-res/dense vision tasks!
๐ฅ Orthogonal to "Slow Thinking" & SOTA Architectures: Achieves consistent, significant gains on LLaVA, Qwen2.5-VL, and Qwen3-VL. Crucially, Q-Zoom seamlessly integrates with the newest RL-trained "Thinking-with-Image" models (e.g., ZwZ), delivering a further performance leap on top of powerful visual slow-thinking capabilities!
๐ฐ Friendly Training Cost:
No expensive human bounding box annotations. No memory-hungry RL. The entire framework relies on self-supervised distillation and consistency-aware sample generation. Total training takes < 8 hours on just 4x A6000 GPUs.
๐ Resources:
๐ Project Page (w/ more visual results): https://yuhengsss.github.io/Q-Zoom/
๐ Paper: https://arxiv.org/pdf/2604.06912
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression (2026)
- Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning (2026)
- The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating (2026)
- ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs (2026)
- TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval (2026)
- Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements (2026)
- Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.06912 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper