arxiv:2604.06912

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Published on Apr 8

· Submitted by

Yuheng Shi on Apr 9

Upvote

Authors:

Yuheng Shi ,

Abstract

Q-Zoom enhances MLLM performance by adaptively focusing computational resources on relevant visual regions through dynamic gating and self-distilled region proposal networks, achieving faster inference without sacrificing accuracy.

AI-generated summary

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

YuhengSSS

Paper author Paper submitter 1 day ago

🚀 Say goodbye to "brute-force high resolution" in MLLMs. Introducing Q-Zoom: On-demand visual token allocation!
Excited to share our latest work, Q-Zoom, tackling the classic "computation explosion" bottleneck when Multimodal Large Language Models (MLLMs) process high-resolution images.

🔍 The Problem:
To read dense documents or spot tiny objects, current MLLMs rely on global dynamic resolution (generating thousands of visual tokens). This exhaustive approach has two fatal flaws:
Query-level Redundancy: Asking "Is it day or night?" still encodes the whole image in 4K, wasting massive compute.
Spatial Redundancy: Asking about tiny text in a corner forces the model to feed huge useless backgrounds (like white walls or sky) into the heavy Transformer self-attention mechanism.

💡 Our Solution: Q-Zoom
We propose a Query-aware adaptive high-resolution perception framework. The core idea: Decide whether high-res is needed and where the high-res region is, directly within the model's intermediate layers.
Three Core Designs:
🚪 Lightweight Dynamic Gating: For simple questions, it outputs answers directly from coarse features, safely bypassing high-res processing to massively boost throughput.
🎯 Self-Distilled RPN (SD-RPN): For complex questions, it leverages the LLM's internal cross-modal attention to predict a precise Region-of-Interest (RoI) heatmap, enabling local high-res cropping.
🧩 Spatio-Temporal Post-SFT: Seamlessly fuses the high-res local features with low-res global context, fixing the loss of spatial awareness caused by cropping.

📊 Key Results & Highlights:
📈 Breaking the Accuracy-Efficiency Trade-off (See Pareto curve): On Qwen2.5-VL 7B, Q-Zoom surpasses the peak accuracy of a native 4096-token baseline using at most 1024 visual tokens!
⚡ Massive Speedups: 2.52x faster (53.0% fewer tokens) on Doc/OCR tasks, and up to 4.39x faster (73.2% fewer tokens) on extreme high-res/dense vision tasks!
🔥 Orthogonal to "Slow Thinking" & SOTA Architectures: Achieves consistent, significant gains on LLaVA, Qwen2.5-VL, and Qwen3-VL. Crucially, Q-Zoom seamlessly integrates with the newest RL-trained "Thinking-with-Image" models (e.g., ZwZ), delivering a further performance leap on top of powerful visual slow-thinking capabilities!
💰 Friendly Training Cost:
No expensive human bounding box annotations. No memory-hungry RL. The entire framework relies on self-supervised distillation and consistency-aware sample generation. Total training takes < 8 hours on just 4x A6000 GPUs.

🔗 Resources:
🌐 Project Page (w/ more visual results): https://yuhengsss.github.io/Q-Zoom/
📄 Paper: https://arxiv.org/pdf/2604.06912