Papers
arxiv:2603.07888

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Published on Mar 9
· Submitted by
Minkyu Kim
on Mar 11
Authors:
,
,

Abstract

VLM-SubtleBench is introduced as a benchmark for evaluating vision-language models on subtle comparative reasoning across diverse domains, revealing significant gaps between model and human performance.

AI-generated summary

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

Community

Paper submitter

We introduce VLM-SubtleBench, a benchmark of 13K image-pair QA triplets to evaluate how well VLMs can reason about subtle visual differences (e.g., micro-expression changes, anomaly detection, medical staging). Unlike prior benchmarks that focus on salient differences, our benchmark covers 10 difference types (Attribute, State, Emotion, Temporal, Spatial, Existence, Quality, Quantity, Viewpoint, Action) across 6 domains (natural, game, industry, aerial, synthetic, medical). We find that even the best proprietary VLMs (GPT-5, Gemini-2.5-Pro) lag humans by 30+ percentage points on spatial/temporal/viewpoint reasoning, and simple prompting strategies provide only limited gains.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07888 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07888 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.