Abstract
STTS is a lightweight module that efficiently prunes vision tokens across both vision transformer and language model architectures in vision-language models for video tasks, achieving significant computational gains with minimal performance loss.
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference (2026)
- ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models (2026)
- The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating (2026)
- Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models (2025)
- OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models (2026)
- EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs (2026)
- ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper