arxiv:2602.11636

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Published on Feb 12

· Submitted by

Changti Wu on Feb 13

Zhongguancun Academy

Upvote

Authors:

Abstract

ScalSelect is a scalable training-free method for selecting representative multimodal data that achieves near-full-dataset performance with significantly reduced computational requirements.

AI-generated summary

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at https://github.com/ChangtiWu/ScalSelect{ScalSelect}.

View arXiv page View PDF GitHub 0 Add to collection

Community

MaplesWCT

Paper submitter about 4 hours ago

ScalSelect: Training-Free and Scalable Data Selection for Visual Instruction Tuning

Large-scale visual instruction tuning datasets are highly redundant, yet full-data training remains the default — leading to substantial computational waste.

This paper introduces ScalSelect, a training-free and gradient-free data selection method with linear-time complexity, specifically designed for visual instruction tuning.

💡 Key Ideas

Leverages instruction-conditioned attention representations to better capture instruction-relevant visual signals
Selects samples from a dominant global subspace, rather than relying on pairwise similarity or proxy model training
Scalable — no additional models, no additional datasets, training-free, no gradient training, linear-time complexity with respect to data size, required for selection

📊 Results

Using only 16% of the data, ScalSelect retains ≥97.5% of full-data performance, and in some cases even surpasses training on the entire dataset.

🚀 Why It Matters

ScalSelect provides a practical and scalable solution for multimodal training. As VLMs continue to scale, data curation becomes increasingly critical — and this work offers a simple yet principled direction forward.

Code: https://github.com/ChangtiWu/ScalSelect

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.11636 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.11636 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.11636 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.