PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking
Abstract
Multi-turn visual reasoning agents suffer from entangled reasoning and perception that cause redundant trajectories; PixelEyes addresses this by decoupling these processes through mask-guided search and semantic-region breadth-first search, demonstrated on a new benchmark with expert-resynthesized data.
This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions. To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.
Community
PixelEyes enhances active visual search in MLLMs by delegating fine-grained localization to a specialized perception tool, thereby achieving efficient and accurate multi-turn visual reasoning.
š¤ Models: https://huggingface.co/collections/godx7/pixeleyes
š¤ Pinpoint-Bench: https://huggingface.co/datasets/godx7/Pinpoint-Bench
Get this paper in your agent:
hf papers read 2607.00115 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
godx7/PixelEyes-8B
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper