MIT-SLS
/

USAD2-Small

@@ -38,58 +38,39 @@ Training data:
 [👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843)
 ## 🗂️ Models
-### Self-supervised Teachers (WavLM, ATST, MuQ)
-General-purpose encoders with good probing performance.
 | Model                                                 | Params | Hidden | Layers | Framerate |
-| ----------------------------------------------------- | ------:| ------:| ------:| --------- |
-| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small)   |    25M |    384 |     12 | 50Hz      |
-| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base)     |    97M |    768 |     12 | 50Hz      |
-| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large)   |   336M |   1024 |     24 | 50Hz      |
-| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) |   695M |   1280 |     32 | 25Hz      |
-### Supervised Teachers (Whisper & Audio Flamingo 3)
-State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
 | Model                                                         | Params | Hidden | Layers (Best) | Framerate |
-| ------------------------------------------------------------- | ------:| ------:| -------------:| --------- |
-| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus)     |   336M |   1024 |       24 (20) | 50Hz      |
-| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus)   |   695M |   1280 |       32 (28) | 25Hz      |
-| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) |  1036M |   1280 |       48 (40) | 25Hz      |
 ---
-## Performance
 - [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
 - [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
 - [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
     - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
     - Track B (understanding): English/Mandarin ASR and audio/music captioning
-<!-- | Audio Encoder     | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B |
-| ----------------- | ------:|:----:|:------:|:-----------:|:-----------:|
-| SOTA (Base)       |   ~90M | 80.6 |  74.0  |    0.660    |    0.418    |
-| SOTA (Large)      |  ~300M | 81.8 |  77.0  |    0.691    |    0.454    |
-| SOTA (XLarge)     |  ~600M | 82.6 |  75.1  |    0.782    |    0.457    |
-| USAD 2.0 Small    |    25M | 81.0 |  72.9  |    0.604    |    0.357    |
-| USAD 2.0 Base     |    97M | 81.9 |  74.1  |    0.645    |    0.442    |
-| USAD 2.0 Large    |   336M | 82.9 |  75.8  |    0.667    |    0.473    |
-| USAD 2.0 XLarge   |   695M | 82.5 |  75.7  |    0.708    |    0.485    |
-| USAD 2.0 Large+   |   336M | 84.0 |  75.1  |    0.769    |    0.580    |
-| USAD 2.0 XLarge+  |   695M | 84.4 |  75.0  |    0.772    |    0.611    |
-| USAD 2.0 XXLarge+ |  1036M | 84.4 |  75.6  |    0.783    |    0.624    | -->
 | Encoder                 | Params |     HEAR |   MARBLE | XARES-LLM-A | XARES-LLM-B |
-| ----------------------- | ------:| --------:| --------:| -----------:| -----------:|
 | **Single-encoder SOTA** |        |          |          |             |             |
 | &ensp; Base             |   ~90M |     80.6 |     74.0 |       0.660 |       0.418 |
 | &ensp; Large            |  ~300M |     81.8 | **77.0** |       0.691 |       0.454 |
@@ -104,11 +85,10 @@ State-of-the-art encoders for audio LLM front-end. The best layers below indicat
 | &ensp; XLarge+          |   695M | **84.4** |     75.0 |       0.772 |       0.611 |
 | &ensp; XXLarge+         |  1036M | **84.4** |     75.6 |   **0.783** |   **0.624** |
-Notes
-* The above evaluation are based on *frozen* encoders.
 * We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.
 ## 🚀 How To Use
@@ -162,23 +142,19 @@ with torch.no_grad():
 # result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
 ```
-Notes
 * The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
 * `bfloat16` is preferred for fast inference.
 * Avoid using `float16` for numerical stability.
-* See [usad2_model.py](https://huggingface.co/MIT-SLS/USAD2-Small/blob/main/usad2_model.py) for more details about the model.
 ---
 ## 📖 Citation
 ```bibtex
-@article{chang2026usad2,
   title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
   author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
-  journal={arXiv preprint arXiv:},
   year={2026}
 }
 ```

 [👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843)
+---
 ## 🗂️ Models
+### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance
 | Model                                                 | Params | Hidden | Layers | Framerate |
+|:----------------------------------------------------- | ------:| ------:| ------:| ---------:|
+| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small)   |    25M |    384 |     12 |      50Hz |
+| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base)     |    97M |    768 |     12 |      50Hz |
+| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large)   |   336M |   1024 |     24 |      50Hz |
+| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) |   695M |   1280 |     32 |      25Hz |
+### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
+We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
 | Model                                                         | Params | Hidden | Layers (Best) | Framerate |
+|:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
+| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus)     |   336M |   1024 |       24 (20) |      50Hz |
+| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus)   |   695M |   1280 |       32 (28) |      25Hz |
+| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) |  1036M |   1280 |       48 (40) |      25Hz |
 ---
+## ⚙️ Performance
 - [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
 - [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
 - [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
     - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
     - Track B (understanding): English/Mandarin ASR and audio/music captioning
 | Encoder                 | Params |     HEAR |   MARBLE | XARES-LLM-A | XARES-LLM-B |
+| :---------------------- | ------:| --------:| --------:| -----------:| -----------:|
 | **Single-encoder SOTA** |        |          |          |             |             |
 | &ensp; Base             |   ~90M |     80.6 |     74.0 |       0.660 |       0.418 |
 | &ensp; Large            |  ~300M |     81.8 | **77.0** |       0.691 |       0.454 |
 | &ensp; XLarge+          |   695M | **84.4** |     75.0 |       0.772 |       0.611 |
 | &ensp; XXLarge+         |  1036M | **84.4** |     75.6 |   **0.783** |   **0.624** |
+* The above evaluations are based on *frozen* encoders.
 * We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.
+---
 ## 🚀 How To Use
 # result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
 ```
 * The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
 * `bfloat16` is preferred for fast inference.
 * Avoid using `float16` for numerical stability.
 ---
 ## 📖 Citation
 ```bibtex
+@inproceedings{chang2026usad2,
   title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
   author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
+  booktitle={Interspeech},
   year={2026}
 }
 ```