File size: 7,574 Bytes

6064eb5

---
license: cc-by-nc-sa-4.0
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
library_name: transformers
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- espnet/mms_ulab_v2
- google/fleurs
- AISHELL/AISHELL-1
- kresnik/zeroth_korean
- ylacombe/expresso
- agkphysics/AudioSet
- 11hu83/vggsound
- benjamin-paine/free-music-archive-full
- rkstgr/mtg-jamendo
language:
- en
---
# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).

Training data:
* Multilingual speech (116k hours)
* General audio and sound (21k hours)
* Music (13k hours)


[👀 **Read Full Paper**](https://arxiv.org/abs/2606.06444)

---

## 🗂️ Models

### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance

| Model                                                 | Params | Hidden | Layers | Framerate |
|:----------------------------------------------------- | ------:| ------:| ------:| ---------:|
| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small)   |    25M |    384 |     12 |      50Hz |
| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base)     |    97M |    768 |     12 |      50Hz |
| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large)   |   336M |   1024 |     24 |      50Hz |
| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) |   695M |   1280 |     32 |      25Hz |

### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.

| Model                                                         | Params | Hidden | Layers (Best) | Framerate |
|:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus)     |   336M |   1024 |       24 (20) |      50Hz |
| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus)   |   695M |   1280 |       32 (28) |      25Hz |
| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) |  1036M |   1280 |       48 (40) |      25Hz |

---

## ⚙️ Performance
- [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
- [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
- [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
    - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
    - Track B (understanding): English/Mandarin ASR and audio/music captioning

| Encoder                 | Params |     HEAR |   MARBLE | XARES-LLM-A | XARES-LLM-B |
| :---------------------- | ------:| --------:| --------:| -----------:| -----------:|
| **Single-encoder SOTA** |        |          |          |             |             |
| &ensp; Base             |   ~90M |     80.6 |     74.0 |       0.660 |       0.418 |
| &ensp; Large            |  ~300M |     81.8 | **77.0** |       0.691 |       0.454 |
| &ensp; XLarge           |  ~600M |     82.6 |     75.1 |       0.782 |       0.457 |
| **USAD 2.0**            |        |          |          |             |             |
| &ensp; Small            |    25M |     81.0 |     72.9 |       0.604 |       0.357 |
| &ensp; Base             |    97M |     81.9 |     74.1 |       0.645 |       0.442 |
| &ensp; Large            |   336M |     82.9 |     75.8 |       0.667 |       0.473 |
| &ensp; XLarge           |   695M |     82.5 |     75.7 |       0.708 |       0.485 |
| **USAD 2.0+**           |        |          |          |             |             |
| &ensp; Large+           |   336M |     84.0 |     75.1 |       0.769 |       0.580 |
| &ensp; XLarge+          |   695M | **84.4** |     75.0 |       0.772 |       0.611 |
| &ensp; XXLarge+         |  1036M | **84.4** |     75.6 |   **0.783** |   **0.624** |

* The above evaluations are based on *frozen* encoders.
* We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.

---

## 🚀 How To Use

**Installation**
```
pip install -U torch torchaudio transformers
```

**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained(
    "MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True
).cuda().eval()

# Model properties
model.sample_rate         # required audio sample rate
model.encoder_frame_rate  # frames per second (Hz)
model.mel_dim             # mel feature dimension
model.encoder_dim         # hidden dimension
model.num_layers          # number of encoder layers
model.device              # device
model.dtype               # dtype

# Model methods
model.set_audio_chunk_size(30.0)  # audio will be chunked if exceeds 30 seconds (default 30s)

# Load audio and resample to 16kHz
wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
# wavs:        raw waveforms (batch_size, max_wav_len)
# wav_lengths: length of each sample (batch_size, )
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(
        wavs=wavs,
        wav_lengths=wav_lengths,
        target_layer=None,  # None for last layer, or integer 1 ~ model.num_layers
    )

# result["x"]:              model final output (batch_size, seq_len, encoder_dim)
# result["x_lengths"]:      valid output lengths after encoder subsampling
# result["x_padding_mask"]: output padding mask, where padding is True
# result["mel"]:            mel fbank (batch_size, mel_len, mel_dim)
# result["mel_lengths"]:    valid mel lengths before encoder subsampling
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
```

* The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
* `bfloat16` is preferred for fast inference.
* Avoid using `float16` for numerical stability.

---

## 📖 Citation

```bibtex
@inproceedings{chang2026usad2,
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
  booktitle={Interspeech},
  year={2026}
}
```

---

## 🙏 Acknowledgement

Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.