USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

USAD 2.0 is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing (HEAR and MARBLE) and LLM-based evaluations (XARES-LLM).

Training data:

Multilingual speech (116k hours)
General audio and sound (21k hours)
Music (13k hours)

👀 Read Full Paper

🗂️ Models

Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance

Model	Params	Hidden	Layers	Framerate
USAD 2.0 Small	25M	384	12	50Hz
USAD 2.0 Base	97M	768	12	50Hz
USAD 2.0 Large	336M	1024	24	50Hz
USAD 2.0 XLarge	695M	1280	32	25Hz

Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend

We suggest selecting the best layer with the target_layer argument in the forward function to optimize audio LLM performance.

Model	Params	Hidden	Layers (Best)	Framerate
USAD 2.0 Large+	336M	1024	24 (20)	50Hz
USAD 2.0 XLarge+	695M	1280	32 (28)	25Hz
USAD 2.0 XXLarge+	1036M	1280	48 (40)	25Hz

⚙️ Performance

HEAR: probing-based general audio evaluation covering speech, sound, and music
MARBLE: probing-based music capability benchmark (instruments and singing voice)
XARES-LLM: frozen audio encoder + LLM with multi-task LoRA fine-tuning
- Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
- Track B (understanding): English/Mandarin ASR and audio/music captioning

Encoder	Params	HEAR	MARBLE	XARES-LLM-A	XARES-LLM-B
Single-encoder SOTA
Base	~90M	80.6	74.0	0.660	0.418
Large	~300M	81.8	77.0	0.691	0.454
XLarge	~600M	82.6	75.1	0.782	0.457
USAD 2.0
Small	25M	81.0	72.9	0.604	0.357
Base	97M	81.9	74.1	0.645	0.442
Large	336M	82.9	75.8	0.667	0.473
XLarge	695M	82.5	75.7	0.708	0.485
USAD 2.0+
Large+	336M	84.0	75.1	0.769	0.580
XLarge+	695M	84.4	75.0	0.772	0.611
XXLarge+	1036M	84.4	75.6	0.783	0.624

The above evaluations are based on frozen encoders.
We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.

🚀 How To Use

Installation

pip install -U torch torchaudio transformers

Load Model and Extract Features

import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained(
    "MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True
).cuda().eval()

# Model properties
model.sample_rate         # required audio sample rate
model.encoder_frame_rate  # frames per second (Hz)
model.mel_dim             # mel feature dimension
model.encoder_dim         # hidden dimension
model.num_layers          # number of encoder layers
model.device              # device
model.dtype               # dtype

# Model methods
model.set_audio_chunk_size(30.0)  # audio will be chunked if exceeds 30 seconds (default 30s)

# Load audio and resample to 16kHz
wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
# wavs:        raw waveforms (batch_size, max_wav_len)
# wav_lengths: length of each sample (batch_size, )
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(
        wavs=wavs,
        wav_lengths=wav_lengths,
        target_layer=None,  # None for last layer, or integer 1 ~ model.num_layers
    )

# result["x"]:              model final output (batch_size, seq_len, encoder_dim)
# result["x_lengths"]:      valid output lengths after encoder subsampling
# result["x_padding_mask"]: output padding mask, where padding is True
# result["mel"]:            mel fbank (batch_size, mel_len, mel_dim)
# result["mel_lengths"]:    valid mel lengths before encoder subsampling
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)

The self-attention mechanism is implemented with SDPA, you may install FlashAttention to optimize inference efficiency.
bfloat16 is preferred for fast inference.
Avoid using float16 for numerical stability.

📖 Citation

@inproceedings{chang2026usad2,
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
  booktitle={Interspeech},
  year={2026}
}