vectominist commited on
Commit
9b2ca4f
Β·
verified Β·
1 Parent(s): 582a9d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -43
README.md CHANGED
@@ -38,58 +38,39 @@ Training data:
38
 
39
  [πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2506.18843)
40
 
41
-
42
 
43
  ## πŸ—‚οΈ Models
44
 
45
- ### Self-supervised Teachers (WavLM, ATST, MuQ)
46
- General-purpose encoders with good probing performance.
47
 
48
  | Model | Params | Hidden | Layers | Framerate |
49
- | ----------------------------------------------------- | ------:| ------:| ------:| --------- |
50
- | [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
51
- | [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
52
- | [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
53
- | [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
54
-
55
 
56
- ### Supervised Teachers (Whisper & Audio Flamingo 3)
57
- State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
58
 
59
  | Model | Params | Hidden | Layers (Best) | Framerate |
60
- | ------------------------------------------------------------- | ------:| ------:| -------------:| --------- |
61
- | [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
62
- | [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
63
- | [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
64
-
65
 
66
  ---
67
 
68
- ## Performance
69
-
70
  - [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
71
  - [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
72
  - [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
73
  - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
74
  - Track B (understanding): English/Mandarin ASR and audio/music captioning
75
 
76
-
77
- <!-- | Audio Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
78
- | ----------------- | ------:|:----:|:------:|:-----------:|:-----------:|
79
- | SOTA (Base) | ~90M | 80.6 | 74.0 | 0.660 | 0.418 |
80
- | SOTA (Large) | ~300M | 81.8 | 77.0 | 0.691 | 0.454 |
81
- | SOTA (XLarge) | ~600M | 82.6 | 75.1 | 0.782 | 0.457 |
82
- | USAD 2.0 Small | 25M | 81.0 | 72.9 | 0.604 | 0.357 |
83
- | USAD 2.0 Base | 97M | 81.9 | 74.1 | 0.645 | 0.442 |
84
- | USAD 2.0 Large | 336M | 82.9 | 75.8 | 0.667 | 0.473 |
85
- | USAD 2.0 XLarge | 695M | 82.5 | 75.7 | 0.708 | 0.485 |
86
- | USAD 2.0 Large+ | 336M | 84.0 | 75.1 | 0.769 | 0.580 |
87
- | USAD 2.0 XLarge+ | 695M | 84.4 | 75.0 | 0.772 | 0.611 |
88
- | USAD 2.0 XXLarge+ | 1036M | 84.4 | 75.6 | 0.783 | 0.624 | -->
89
-
90
-
91
  | Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
92
- | ----------------------- | ------:| --------:| --------:| -----------:| -----------:|
93
  | **Single-encoder SOTA** | | | | | |
94
  | &ensp; Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 |
95
  | &ensp; Large | ~300M | 81.8 | **77.0** | 0.691 | 0.454 |
@@ -104,11 +85,10 @@ State-of-the-art encoders for audio LLM front-end. The best layers below indicat
104
  | &ensp; XLarge+ | 695M | **84.4** | 75.0 | 0.772 | 0.611 |
105
  | &ensp; XXLarge+ | 1036M | **84.4** | 75.6 | **0.783** | **0.624** |
106
 
107
-
108
- Notes
109
- * The above evaluation are based on *frozen* encoders.
110
  * We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.
111
 
 
112
 
113
  ## πŸš€ How To Use
114
 
@@ -162,23 +142,19 @@ with torch.no_grad():
162
  # result["ffn"]: list of (batch_size, seq_len, encoder_dim)
163
  ```
164
 
165
-
166
- Notes
167
  * The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
168
  * `bfloat16` is preferred for fast inference.
169
  * Avoid using `float16` for numerical stability.
170
- * See [usad2_model.py](https://huggingface.co/MIT-SLS/USAD2-Small/blob/main/usad2_model.py) for more details about the model.
171
-
172
 
173
  ---
174
 
175
  ## πŸ“– Citation
176
 
177
  ```bibtex
178
- @article{chang2026usad2,
179
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
180
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
181
- journal={arXiv preprint arXiv:},
182
  year={2026}
183
  }
184
  ```
 
38
 
39
  [πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2506.18843)
40
 
41
+ ---
42
 
43
  ## πŸ—‚οΈ Models
44
 
45
+ ### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance
 
46
 
47
  | Model | Params | Hidden | Layers | Framerate |
48
+ |:----------------------------------------------------- | ------:| ------:| ------:| ---------:|
49
+ | [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
50
+ | [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
51
+ | [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
52
+ | [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
 
53
 
54
+ ### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
55
+ We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
56
 
57
  | Model | Params | Hidden | Layers (Best) | Framerate |
58
+ |:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
59
+ | [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
60
+ | [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
61
+ | [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
 
62
 
63
  ---
64
 
65
+ ## βš™οΈ Performance
 
66
  - [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
67
  - [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
68
  - [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
69
  - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
70
  - Track B (understanding): English/Mandarin ASR and audio/music captioning
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  | Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
73
+ | :---------------------- | ------:| --------:| --------:| -----------:| -----------:|
74
  | **Single-encoder SOTA** | | | | | |
75
  | &ensp; Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 |
76
  | &ensp; Large | ~300M | 81.8 | **77.0** | 0.691 | 0.454 |
 
85
  | &ensp; XLarge+ | 695M | **84.4** | 75.0 | 0.772 | 0.611 |
86
  | &ensp; XXLarge+ | 1036M | **84.4** | 75.6 | **0.783** | **0.624** |
87
 
88
+ * The above evaluations are based on *frozen* encoders.
 
 
89
  * We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.
90
 
91
+ ---
92
 
93
  ## πŸš€ How To Use
94
 
 
142
  # result["ffn"]: list of (batch_size, seq_len, encoder_dim)
143
  ```
144
 
 
 
145
  * The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
146
  * `bfloat16` is preferred for fast inference.
147
  * Avoid using `float16` for numerical stability.
 
 
148
 
149
  ---
150
 
151
  ## πŸ“– Citation
152
 
153
  ```bibtex
154
+ @inproceedings{chang2026usad2,
155
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
156
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
157
+ booktitle={Interspeech},
158
  year={2026}
159
  }
160
  ```