openslr/librispeech_asr
Viewer • Updated • 585k • 109k • 228
How to use MIT-SLS/USAD2-XLarge-Plus with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True, dtype="auto")# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True, dtype="auto")USAD 2.0 is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing (HEAR and MARBLE) and LLM-based evaluations (XARES-LLM).
Training data:
| Model | Params | Hidden | Layers | Framerate |
|---|---|---|---|---|
| USAD 2.0 Small | 25M | 384 | 12 | 50Hz |
| USAD 2.0 Base | 97M | 768 | 12 | 50Hz |
| USAD 2.0 Large | 336M | 1024 | 24 | 50Hz |
| USAD 2.0 XLarge | 695M | 1280 | 32 | 25Hz |
We suggest selecting the best layer with the target_layer argument in the forward function to optimize audio LLM performance.
| Model | Params | Hidden | Layers (Best) | Framerate |
|---|---|---|---|---|
| USAD 2.0 Large+ | 336M | 1024 | 24 (20) | 50Hz |
| USAD 2.0 XLarge+ | 695M | 1280 | 32 (28) | 25Hz |
| USAD 2.0 XXLarge+ | 1036M | 1280 | 48 (40) | 25Hz |
| Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
|---|---|---|---|---|---|
| Single-encoder SOTA | |||||
| Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 |
| Large | ~300M | 81.8 | 77.0 | 0.691 | 0.454 |
| XLarge | ~600M | 82.6 | 75.1 | 0.782 | 0.457 |
| USAD 2.0 | |||||
| Small | 25M | 81.0 | 72.9 | 0.604 | 0.357 |
| Base | 97M | 81.9 | 74.1 | 0.645 | 0.442 |
| Large | 336M | 82.9 | 75.8 | 0.667 | 0.473 |
| XLarge | 695M | 82.5 | 75.7 | 0.708 | 0.485 |
| USAD 2.0+ | |||||
| Large+ | 336M | 84.0 | 75.1 | 0.769 | 0.580 |
| XLarge+ | 695M | 84.4 | 75.0 | 0.772 | 0.611 |
| XXLarge+ | 1036M | 84.4 | 75.6 | 0.783 | 0.624 |
Installation
pip install -U torch torchaudio transformers
Load Model and Extract Features
import torch
from transformers import AutoModel
# Load pre-trained model
model = AutoModel.from_pretrained(
"MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True
).cuda().eval()
# Model properties
model.sample_rate # required audio sample rate
model.encoder_frame_rate # frames per second (Hz)
model.mel_dim # mel feature dimension
model.encoder_dim # hidden dimension
model.num_layers # number of encoder layers
model.device # device
model.dtype # dtype
# Model methods
model.set_audio_chunk_size(30.0) # audio will be chunked if exceeds 30 seconds (default 30s)
# Load audio and resample to 16kHz
wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
# wavs: raw waveforms (batch_size, max_wav_len)
# wav_lengths: length of each sample (batch_size, )
# You can also load waveforms directly with torchaudio.load
# Extract features
with torch.no_grad():
results = model(
wavs=wavs,
wav_lengths=wav_lengths,
target_layer=None, # None for last layer, or integer 1 ~ model.num_layers
)
# result["x"]: model final output (batch_size, seq_len, encoder_dim)
# result["x_lengths"]: valid output lengths after encoder subsampling
# result["x_padding_mask"]: output padding mask, where padding is True
# result["mel"]: mel fbank (batch_size, mel_len, mel_dim)
# result["mel_lengths"]: valid mel lengths before encoder subsampling
# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
# result["ffn"]: list of (batch_size, seq_len, encoder_dim)
bfloat16 is preferred for fast inference.float16 for numerical stability.@inproceedings{chang2026usad2,
title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
booktitle={Interspeech},
year={2026}
}
Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True)