HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models
Abstract
HeBA introduces a heterogeneous bottleneck adapter framework for Vision-Language Models that uses modality-specific processing techniques and structural regularization to improve few-shot learning performance.
Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.
Community
HeBA (Heterogeneous Bottleneck Adapters), a novel approach designed to enhance the robustness of Vision-Language Models, is introduced in this work. How performance and adaptability can be efficiently improved through heterogeneous adapter architectures is explored. The official implementation and pre-trained weights have been made available in the linked GitHub repository. Feedback and discussions from the community are highly welcomed!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation (2026)
- Evolving Prompt Adaptation for Vision-Language Models (2026)
- MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation (2026)
- DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles (2026)
- Bi-modal Textual Prompt Learning for Vision-language Models in Remote Sensing (2026)
- ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models (2026)
- PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper