From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
Abstract
The study introduces a novel attention-based metric called Visual Attention Score to analyze cold-start initialization in multimodal large reasoning models, identifying a counter-intuitive phenomenon termed Lazy Attention Localization and proposing a framework called Attention-Guided Visual Anchoring and Reflection to improve multimodal reasoning performance.
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1-2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.
Community
We reveal that visual attention is the key bottleneck in multimodal reasoning—models with higher visual attention scores (VAS) perform dramatically better. We introduce AVAR, a cold-start framework that explicitly reshapes attention allocation, achieving +7.0% average gain across 7 benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs (2026)
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs (2026)
- MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions (2026)
- PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs (2026)
- OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention (2026)
- What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper