vectominist commited on
Commit
582a9d7
·
verified ·
1 Parent(s): 94d3a9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -28,7 +28,7 @@ language:
28
  ---
29
  # USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
30
 
31
- **USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 consistently outperforms prior encoders across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
32
 
33
  Training data:
34
  * Multilingual speech (116k hours)
@@ -45,22 +45,22 @@ Training data:
45
  ### Self-supervised Teachers (WavLM, ATST, MuQ)
46
  General-purpose encoders with good probing performance.
47
 
48
- | Model | Params | Hidden | Layers | Framerate |
49
- | -------------------------------------------------------------- | ------:| ------:| ------:| ---------:|
50
- | [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
51
- | [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
52
- | [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
53
- | [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
54
 
55
 
56
  ### Supervised Teachers (Whisper & Audio Flamingo 3)
57
  State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
58
 
59
- | Model | Params | Hidden | Layers (Best) | Framerate |
60
- | ---------------------------------------------------------------------- | ------:| ------:| -------------:| ---------:|
61
- | [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
62
- | [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
63
- | [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
64
 
65
 
66
  ---
 
28
  ---
29
  # USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding
30
 
31
+ **USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).
32
 
33
  Training data:
34
  * Multilingual speech (116k hours)
 
45
  ### Self-supervised Teachers (WavLM, ATST, MuQ)
46
  General-purpose encoders with good probing performance.
47
 
48
+ | Model | Params | Hidden | Layers | Framerate |
49
+ | ----------------------------------------------------- | ------:| ------:| ------:| --------- |
50
+ | [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz |
51
+ | [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz |
52
+ | [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz |
53
+ | [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz |
54
 
55
 
56
  ### Supervised Teachers (Whisper & Audio Flamingo 3)
57
  State-of-the-art encoders for audio LLM front-end. The best layers below indicate the best representations for the [XARES-LLM benchmark](https://github.com/xiaomi-research/xares-llm). We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.
58
 
59
+ | Model | Params | Hidden | Layers (Best) | Framerate |
60
+ | ------------------------------------------------------------- | ------:| ------:| -------------:| --------- |
61
+ | [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz |
62
+ | [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz |
63
+ | [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz |
64
 
65
 
66
  ---