arxiv:2603.06854

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

Published on Mar 6

· Submitted by

glazer on Mar 11

Bar-Ilan University

Upvote

Authors:

Abstract

Mechanistic interpretability identifies audio-specialist attention heads in large audio-language models to enhance audio utilization through activation interventions at inference time.

AI-generated summary

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

View arXiv page View PDF Add to collection

Community

netag

Paper submitter about 3 hours ago

In this paper, we ask whether audio-language models are actually listening to the audio, or mostly leaning on language priors. We find that a small set of audio-specialist heads plays a key role, and that steering them at inference time can noticeably improve audio grounding without any retraining.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.06854 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.06854 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.06854 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.