MIT-SLS
/

USAD2-Small

@@ -28,7 +28,7 @@ language:
 ---
 # USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
-**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 consistently outperforms prior encoders across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
 Training data:
 * Multilingual speech (116k hours)
@@ -45,22 +45,22 @@ Training data:
 ### Self-supervised Teachers (WavLM, ATST, MuQ)
 General-purpose encoders with good probing performance.
-| Model                                                          | Params | Hidden | Layers | Framerate |
-| -------------------------------------------------------------- | ------:| ------:| ------:| ---------:|
-| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small)   |    25M |    384 |     12 |      50Hz |
-| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base)     |    97M |    768 |     12 |      50Hz |
-| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large)   |   336M |   1024 |     24 |      50Hz |
-| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) |   695M |   1280 |     32 |      25Hz |
 ### Supervised Teachers (Whisper & Audio Flamingo 3)
 State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
-| Model                                                                  | Params | Hidden | Layers (Best) | Framerate |
-| ---------------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
-| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus)     |   336M |   1024 |       24 (20) |      50Hz |
-| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus)   |   695M |   1280 |       32 (28) |      25Hz |
-| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) |  1036M |   1280 |       48 (40) |      25Hz |
 ---

 ---
 # USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
+**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
 Training data:
 * Multilingual speech (116k hours)
 ### Self-supervised Teachers (WavLM, ATST, MuQ)
 General-purpose encoders with good probing performance.
+| Model                                                 | Params | Hidden | Layers | Framerate |
+| ----------------------------------------------------- | ------:| ------:| ------:| --------- |
+| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small)   |    25M |    384 |     12 | 50Hz      |
+| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base)     |    97M |    768 |     12 | 50Hz      |
+| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large)   |   336M |   1024 |     24 | 50Hz      |
+| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) |   695M |   1280 |     32 | 25Hz      |
 ### Supervised Teachers (Whisper & Audio Flamingo 3)
 State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
+| Model                                                         | Params | Hidden | Layers (Best) | Framerate |
+| ------------------------------------------------------------- | ------:| ------:| -------------:| --------- |
+| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus)     |   336M |   1024 |       24 (20) | 50Hz      |
+| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus)   |   695M |   1280 |       32 (28) | 25Hz      |
+| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) |  1036M |   1280 |       48 (40) | 25Hz      |
 ---