SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering
Abstract
A semantic-aware and geometry-guided token pruning framework is presented for efficient 3D question answering with multi-view images, achieving significant reductions in token budget and inference latency while maintaining competitive performance.
Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.
Community
SeGPruner is a token pruning framework for 3D visual question answering that combines attention-based saliency selection with geometry-guided spatial diversification to reduce visual token count by 91% and inference latency by 86%, while preserving competitive reasoning performance across multi-view 3D scenes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)
- Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning (2026)
- QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression (2026)
- ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs (2026)
- ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models (2026)
- Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs (2026)
- Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.29437 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper