Feature Extraction
Transformers
Safetensors
English
usad2
automatic-speech-recognition
audio-classification
audio
speech
music
custom_code
Instructions to use MIT-SLS/USAD2-Small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MIT-SLS/USAD2-Small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="MIT-SLS/USAD2-Small", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MIT-SLS/USAD2-Small", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -28,7 +28,7 @@ language:
|
|
| 28 |
---
|
| 29 |
# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
|
| 30 |
|
| 31 |
-
**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0
|
| 32 |
|
| 33 |
Training data:
|
| 34 |
* Multilingual speech (116k hours)
|
|
@@ -45,22 +45,22 @@ Training data:
|
|
| 45 |
### Self-supervised Teachers (WavLM, ATST, MuQ)
|
| 46 |
General-purpose encoders with good probing performance.
|
| 47 |
|
| 48 |
-
| Model
|
| 49 |
-
| -----------------------------------------------------
|
| 50 |
-
| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 |
|
| 51 |
-
| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 |
|
| 52 |
-
| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 |
|
| 53 |
-
| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 |
|
| 54 |
|
| 55 |
|
| 56 |
### Supervised Teachers (Whisper & Audio Flamingo 3)
|
| 57 |
State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
|
| 58 |
|
| 59 |
-
| Model
|
| 60 |
-
| -------------------------------------------------------------
|
| 61 |
-
| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) |
|
| 62 |
-
| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) |
|
| 63 |
-
| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) |
|
| 64 |
|
| 65 |
|
| 66 |
---
|
|
|
|
| 28 |
---
|
| 29 |
# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
|
| 30 |
|
| 31 |
+
**USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
|
| 32 |
|
| 33 |
Training data:
|
| 34 |
* Multilingual speech (116k hours)
|
|
|
|
| 45 |
### Self-supervised Teachers (WavLM, ATST, MuQ)
|
| 46 |
General-purpose encoders with good probing performance.
|
| 47 |
|
| 48 |
+
| Model | Params | Hidden | Layers | Framerate |
|
| 49 |
+
| ----------------------------------------------------- | ------:| ------:| ------:| --------- |
|
| 50 |
+
| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
|
| 51 |
+
| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
|
| 52 |
+
| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
|
| 53 |
+
| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
|
| 54 |
|
| 55 |
|
| 56 |
### Supervised Teachers (Whisper & Audio Flamingo 3)
|
| 57 |
State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
|
| 58 |
|
| 59 |
+
| Model | Params | Hidden | Layers (Best) | Framerate |
|
| 60 |
+
| ------------------------------------------------------------- | ------:| ------:| -------------:| --------- |
|
| 61 |
+
| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
|
| 62 |
+
| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
|
| 63 |
+
| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
|
| 64 |
|
| 65 |
|
| 66 |
---
|