File size: 7,568 Bytes
94d3a9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
582a9d7
94d3a9f
 
 
 
 
 
 
d867ebd
94d3a9f
9b2ca4f
94d3a9f
 
 
9b2ca4f
94d3a9f
582a9d7
9b2ca4f
 
 
 
 
94d3a9f
9b2ca4f
 
94d3a9f
582a9d7
9b2ca4f
 
 
 
94d3a9f
 
 
9b2ca4f
94d3a9f
 
 
 
 
 
 
9b2ca4f
94d3a9f
 
 
 
 
 
 
 
 
 
 
 
 
 
9b2ca4f
94d3a9f
 
9b2ca4f
94d3a9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b2ca4f
94d3a9f
 
9b2ca4f
94d3a9f
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: cc-by-nc-sa-4.0
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
library_name: transformers
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- espnet/mms_ulab_v2
- google/fleurs
- AISHELL/AISHELL-1
- kresnik/zeroth_korean
- ylacombe/expresso
- agkphysics/AudioSet
- 11hu83/vggsound
- benjamin-paine/free-music-archive-full
- rkstgr/mtg-jamendo
language:
- en
---
# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).

Training data:
* Multilingual speech (116k hours)
* General audio and sound (21k hours)
* Music (13k hours)


[πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2606.06444)

---

## πŸ—‚οΈ Models

### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance

| Model                                                 | Params | Hidden | Layers | Framerate |
|:----------------------------------------------------- | ------:| ------:| ------:| ---------:|
| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small)   |    25M |    384 |     12 |      50Hz |
| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base)     |    97M |    768 |     12 |      50Hz |
| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large)   |   336M |   1024 |     24 |      50Hz |
| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) |   695M |   1280 |     32 |      25Hz |

### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.

| Model                                                         | Params | Hidden | Layers (Best) | Framerate |
|:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus)     |   336M |   1024 |       24 (20) |      50Hz |
| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus)   |   695M |   1280 |       32 (28) |      25Hz |
| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) |  1036M |   1280 |       48 (40) |      25Hz |

---

## βš™οΈ Performance
- [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
- [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
- [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
    - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
    - Track B (understanding): English/Mandarin ASR and audio/music captioning

| Encoder                 | Params |     HEAR |   MARBLE | XARES-LLM-A | XARES-LLM-B |
| :---------------------- | ------:| --------:| --------:| -----------:| -----------:|
| **Single-encoder SOTA** |        |          |          |             |             |
|   Base             |   ~90M |     80.6 |     74.0 |       0.660 |       0.418 |
|   Large            |  ~300M |     81.8 | **77.0** |       0.691 |       0.454 |
|   XLarge           |  ~600M |     82.6 |     75.1 |       0.782 |       0.457 |
| **USAD 2.0**            |        |          |          |             |             |
|   Small            |    25M |     81.0 |     72.9 |       0.604 |       0.357 |
|   Base             |    97M |     81.9 |     74.1 |       0.645 |       0.442 |
|   Large            |   336M |     82.9 |     75.8 |       0.667 |       0.473 |
|   XLarge           |   695M |     82.5 |     75.7 |       0.708 |       0.485 |
| **USAD 2.0+**           |        |          |          |             |             |
|   Large+           |   336M |     84.0 |     75.1 |       0.769 |       0.580 |
|   XLarge+          |   695M | **84.4** |     75.0 |       0.772 |       0.611 |
|   XXLarge+         |  1036M | **84.4** |     75.6 |   **0.783** |   **0.624** |

* The above evaluations are based on *frozen* encoders.
* We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.

---

## πŸš€ How To Use

**Installation**
```
pip install -U torch torchaudio transformers
```

**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained(
    "MIT-SLS/USAD2-Small", trust_remote_code=True
).cuda().eval()

# Model properties
model.sample_rate         # required audio sample rate
model.encoder_frame_rate  # frames per second (Hz)
model.mel_dim             # mel feature dimension
model.encoder_dim         # hidden dimension
model.num_layers          # number of encoder layers
model.device              # device
model.dtype               # dtype

# Model methods
model.set_audio_chunk_size(30.0)  # audio will be chunked if exceeds 30 seconds (default 30s)

# Load audio and resample to 16kHz
wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
# wavs:        raw waveforms (batch_size, max_wav_len)
# wav_lengths: length of each sample (batch_size, )
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(
        wavs=wavs,
        wav_lengths=wav_lengths,
        target_layer=None,  # None for last layer, or integer 1 ~ model.num_layers
    )

# result["x"]:              model final output (batch_size, seq_len, encoder_dim)
# result["x_lengths"]:      valid output lengths after encoder subsampling
# result["x_padding_mask"]: output padding mask, where padding is True
# result["mel"]:            mel fbank (batch_size, mel_len, mel_dim)
# result["mel_lengths"]:    valid mel lengths before encoder subsampling
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
```

* The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
* `bfloat16` is preferred for fast inference.
* Avoid using `float16` for numerical stability.

---

## πŸ“– Citation

```bibtex
@inproceedings{chang2026usad2,
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
  booktitle={Interspeech},
  year={2026}
}
```

---

## πŸ™ Acknowledgement

Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.