Feature Extraction
Transformers
Safetensors
English
usad2
automatic-speech-recognition
audio-classification
audio
speech
music
custom_code
Instructions to use MIT-SLS/USAD2-XLarge-Plus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MIT-SLS/USAD2-XLarge-Plus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 7,574 Bytes
6064eb5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ---
license: cc-by-nc-sa-4.0
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
library_name: transformers
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- espnet/mms_ulab_v2
- google/fleurs
- AISHELL/AISHELL-1
- kresnik/zeroth_korean
- ylacombe/expresso
- agkphysics/AudioSet
- 11hu83/vggsound
- benjamin-paine/free-music-archive-full
- rkstgr/mtg-jamendo
language:
- en
---
# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
Training data:
* Multilingual speech (116k hours)
* General audio and sound (21k hours)
* Music (13k hours)
[π **Read Full Paper**](https://arxiv.org/abs/2606.06444)
---
## ποΈ Models
### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance
| Model | Params | Hidden | Layers | Framerate |
|:----------------------------------------------------- | ------:| ------:| ------:| ---------:|
| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
| Model | Params | Hidden | Layers (Best) | Framerate |
|:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
---
## βοΈ Performance
- [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
- [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
- [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
- Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
- Track B (understanding): English/Mandarin ASR and audio/music captioning
| Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
| :---------------------- | ------:| --------:| --------:| -----------:| -----------:|
| **Single-encoder SOTA** | | | | | |
|   Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 |
|   Large | ~300M | 81.8 | **77.0** | 0.691 | 0.454 |
|   XLarge | ~600M | 82.6 | 75.1 | 0.782 | 0.457 |
| **USAD 2.0** | | | | | |
|   Small | 25M | 81.0 | 72.9 | 0.604 | 0.357 |
|   Base | 97M | 81.9 | 74.1 | 0.645 | 0.442 |
|   Large | 336M | 82.9 | 75.8 | 0.667 | 0.473 |
|   XLarge | 695M | 82.5 | 75.7 | 0.708 | 0.485 |
| **USAD 2.0+** | | | | | |
|   Large+ | 336M | 84.0 | 75.1 | 0.769 | 0.580 |
|   XLarge+ | 695M | **84.4** | 75.0 | 0.772 | 0.611 |
|   XXLarge+ | 1036M | **84.4** | 75.6 | **0.783** | **0.624** |
* The above evaluations are based on *frozen* encoders.
* We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.
---
## π How To Use
**Installation**
```
pip install -U torch torchaudio transformers
```
**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel
# Load pre-trained model
model = AutoModel.from_pretrained(
"MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True
).cuda().eval()
# Model properties
model.sample_rate # required audio sample rate
model.encoder_frame_rate # frames per second (Hz)
model.mel_dim # mel feature dimension
model.encoder_dim # hidden dimension
model.num_layers # number of encoder layers
model.device # device
model.dtype # dtype
# Model methods
model.set_audio_chunk_size(30.0) # audio will be chunked if exceeds 30 seconds (default 30s)
# Load audio and resample to 16kHz
wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
# wavs: raw waveforms (batch_size, max_wav_len)
# wav_lengths: length of each sample (batch_size, )
# You can also load waveforms directly with torchaudio.load
# Extract features
with torch.no_grad():
results = model(
wavs=wavs,
wav_lengths=wav_lengths,
target_layer=None, # None for last layer, or integer 1 ~ model.num_layers
)
# result["x"]: model final output (batch_size, seq_len, encoder_dim)
# result["x_lengths"]: valid output lengths after encoder subsampling
# result["x_padding_mask"]: output padding mask, where padding is True
# result["mel"]: mel fbank (batch_size, mel_len, mel_dim)
# result["mel_lengths"]: valid mel lengths before encoder subsampling
# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
# result["ffn"]: list of (batch_size, seq_len, encoder_dim)
```
* The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
* `bfloat16` is preferred for fast inference.
* Avoid using `float16` for numerical stability.
---
## π Citation
```bibtex
@inproceedings{chang2026usad2,
title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
booktitle={Interspeech},
year={2026}
}
```
---
## π Acknowledgement
Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.
|