arxiv:2603.07786

OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models

Published on Mar 8

Authors:

Abstract

Vision-language models exhibit poor generalization in ordinal number understanding, prompting the creation of OrdinalBench, a diagnostic benchmark that evaluates sequential reasoning through structured stepwise tracing and controlled difficulty levels.

AI-generated summary

Vision-Language Models (VLMs) have advanced across multimodal benchmarks but still show clear gaps in ordinal number understanding, i.e., the ability to track relative positions and generalize to large indices. We present OrdinalBench, a diagnostic benchmark that standardizes ordinal number understanding as an evaluation task for VLMs. The core task is N-th object identification, defined by a starting reference and traversal rule. Task difficulty is controlled along three axes: (i) ordinal magnitude, from small numbers to extreme cases up to 300; (ii) arrangement complexity, from single loops to maze-like paths; and (iii) object count. The benchmark provides 39,000 question-answer pairs, each annotated with a ground-truth reasoning trajectory and balanced across difficulty levels for controlled large-scale testing. Beyond answer-only evaluation, our framework requires models to generate structured stepwise traces of the counting process and provides an open evaluation toolkit that measures both final accuracy and step-level path consistency. Zero-shot evaluations of GPT-5, Gemini 2.5 Flash Lite, Qwen2.5-VL, InternVL3.5, and Molmo reveal sharp degradation under large-ordinal and complex-path conditions, highlighting weak generalization despite strong scores on standard multimodal tasks. By framing ordinal number understanding as a core target, OrdinalBench provides a reproducible benchmark and diagnostic framework for developing VLMs with stronger sequential reasoning. All data and code are available at https://ordinalbench.github.io/

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07786 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07786 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07786 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.