AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Abstract
AdaptToken enables efficient long-video understanding by using model uncertainty to dynamically select relevant tokens across video segments, achieving improved accuracy and reduced inference time through global budget allocation and early stopping mechanisms.
Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
Community
AdaptToken is a training-free framework for long video understanding with MLLMs. It uses response entropy as a global uncertainty signal to allocate token budgets across video groups, together with cross-modal attention for intra-group token ranking. This enables both strong long-context performance and an efficient early-stopping variant (AdaptToken-Lite).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs (2026)
- FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging (2026)
- KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs (2026)
- ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models (2026)
- CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling (2026)
- DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference (2026)
- Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper