USAD2-XXLarge-Plus / README.md
vectominist's picture
Add USAD2 model
17e49ef verified
---
license: cc-by-nc-sa-4.0
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
library_name: transformers
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- espnet/mms_ulab_v2
- google/fleurs
- AISHELL/AISHELL-1
- kresnik/zeroth_korean
- ylacombe/expresso
- agkphysics/AudioSet
- 11hu83/vggsound
- benjamin-paine/free-music-archive-full
- rkstgr/mtg-jamendo
language:
- en
---
# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
Training data:
* Multilingual speech (116k hours)
* General audio and sound (21k hours)
* Music (13k hours)
[πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2606.06444)
---
## πŸ—‚οΈ Models
### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance
| Model | Params | Hidden | Layers | Framerate |
|:----------------------------------------------------- | ------:| ------:| ------:| ---------:|
| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
| Model | Params | Hidden | Layers (Best) | Framerate |
|:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
---
## βš™οΈ Performance
- [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
- [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
- [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
- Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
- Track B (understanding): English/Mandarin ASR and audio/music captioning
| Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
| :---------------------- | ------:| --------:| --------:| -----------:| -----------:|
| **Single-encoder SOTA** | | | | | |
|   Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 |
|   Large | ~300M | 81.8 | **77.0** | 0.691 | 0.454 |
|   XLarge | ~600M | 82.6 | 75.1 | 0.782 | 0.457 |
| **USAD 2.0** | | | | | |
|   Small | 25M | 81.0 | 72.9 | 0.604 | 0.357 |
|   Base | 97M | 81.9 | 74.1 | 0.645 | 0.442 |
|   Large | 336M | 82.9 | 75.8 | 0.667 | 0.473 |
|   XLarge | 695M | 82.5 | 75.7 | 0.708 | 0.485 |
| **USAD 2.0+** | | | | | |
|   Large+ | 336M | 84.0 | 75.1 | 0.769 | 0.580 |
|   XLarge+ | 695M | **84.4** | 75.0 | 0.772 | 0.611 |
|   XXLarge+ | 1036M | **84.4** | 75.6 | **0.783** | **0.624** |
* The above evaluations are based on *frozen* encoders.
* We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.
---
## πŸš€ How To Use
**Installation**
```
pip install -U torch torchaudio transformers
```
**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel
# Load pre-trained model
model = AutoModel.from_pretrained(
"MIT-SLS/USAD2-XXLarge-Plus", trust_remote_code=True
).cuda().eval()
# Model properties
model.sample_rate # required audio sample rate
model.encoder_frame_rate # frames per second (Hz)
model.mel_dim # mel feature dimension
model.encoder_dim # hidden dimension
model.num_layers # number of encoder layers
model.device # device
model.dtype # dtype
# Model methods
model.set_audio_chunk_size(30.0) # audio will be chunked if exceeds 30 seconds (default 30s)
# Load audio and resample to 16kHz
wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
# wavs: raw waveforms (batch_size, max_wav_len)
# wav_lengths: length of each sample (batch_size, )
# You can also load waveforms directly with torchaudio.load
# Extract features
with torch.no_grad():
results = model(
wavs=wavs,
wav_lengths=wav_lengths,
target_layer=None, # None for last layer, or integer 1 ~ model.num_layers
)
# result["x"]: model final output (batch_size, seq_len, encoder_dim)
# result["x_lengths"]: valid output lengths after encoder subsampling
# result["x_padding_mask"]: output padding mask, where padding is True
# result["mel"]: mel fbank (batch_size, mel_len, mel_dim)
# result["mel_lengths"]: valid mel lengths before encoder subsampling
# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
# result["ffn"]: list of (batch_size, seq_len, encoder_dim)
```
* The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
* `bfloat16` is preferred for fast inference.
* Avoid using `float16` for numerical stability.
---
## πŸ“– Citation
```bibtex
@inproceedings{chang2026usad2,
title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
booktitle={Interspeech},
year={2026}
}
```
---
## πŸ™ Acknowledgement
Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.