Feature Extraction
Transformers
Safetensors
English
usad2
automatic-speech-recognition
audio-classification
audio
speech
music
custom_code
Instructions to use MIT-SLS/USAD2-Large-Plus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MIT-SLS/USAD2-Large-Plus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="MIT-SLS/USAD2-Large-Plus", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MIT-SLS/USAD2-Large-Plus", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-sa-4.0 | |
| pipeline_tag: feature-extraction | |
| tags: | |
| - automatic-speech-recognition | |
| - audio-classification | |
| - audio | |
| - speech | |
| - music | |
| library_name: transformers | |
| datasets: | |
| - openslr/librispeech_asr | |
| - facebook/multilingual_librispeech | |
| - mozilla-foundation/common_voice_17_0 | |
| - speechcolab/gigaspeech | |
| - facebook/voxpopuli | |
| - espnet/mms_ulab_v2 | |
| - google/fleurs | |
| - AISHELL/AISHELL-1 | |
| - kresnik/zeroth_korean | |
| - ylacombe/expresso | |
| - agkphysics/AudioSet | |
| - 11hu83/vggsound | |
| - benjamin-paine/free-music-archive-full | |
| - rkstgr/mtg-jamendo | |
| language: | |
| - en | |
| # USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding | |
| **USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)). | |
| Training data: | |
| * Multilingual speech (116k hours) | |
| * General audio and sound (21k hours) | |
| * Music (13k hours) | |
| [π **Read Full Paper**](https://arxiv.org/abs/2606.06444) | |
| --- | |
| ## ποΈ Models | |
| ### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance | |
| | Model | Params | Hidden | Layers | Framerate | | |
| |:----------------------------------------------------- | ------:| ------:| ------:| ---------:| | |
| | [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz | | |
| | [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz | | |
| | [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz | | |
| | [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz | | |
| ### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend | |
| We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance. | |
| | Model | Params | Hidden | Layers (Best) | Framerate | | |
| |:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:| | |
| | [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz | | |
| | [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz | | |
| | [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz | | |
| --- | |
| ## βοΈ Performance | |
| - [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music | |
| - [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice) | |
| - [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning | |
| - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection. | |
| - Track B (understanding): English/Mandarin ASR and audio/music captioning | |
| | Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B | | |
| | :---------------------- | ------:| --------:| --------:| -----------:| -----------:| | |
| | **Single-encoder SOTA** | | | | | | | |
| |   Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 | | |
| |   Large | ~300M | 81.8 | **77.0** | 0.691 | 0.454 | | |
| |   XLarge | ~600M | 82.6 | 75.1 | 0.782 | 0.457 | | |
| | **USAD 2.0** | | | | | | | |
| |   Small | 25M | 81.0 | 72.9 | 0.604 | 0.357 | | |
| |   Base | 97M | 81.9 | 74.1 | 0.645 | 0.442 | | |
| |   Large | 336M | 82.9 | 75.8 | 0.667 | 0.473 | | |
| |   XLarge | 695M | 82.5 | 75.7 | 0.708 | 0.485 | | |
| | **USAD 2.0+** | | | | | | | |
| |   Large+ | 336M | 84.0 | 75.1 | 0.769 | 0.580 | | |
| |   XLarge+ | 695M | **84.4** | 75.0 | 0.772 | 0.611 | | |
| |   XXLarge+ | 1036M | **84.4** | 75.6 | **0.783** | **0.624** | | |
| * The above evaluations are based on *frozen* encoders. | |
| * We encourage fine-tuning USAD 2.0 models for optimal downstream task performance. | |
| --- | |
| ## π How To Use | |
| **Installation** | |
| ``` | |
| pip install -U torch torchaudio transformers | |
| ``` | |
| **Load Model and Extract Features** | |
| ```python | |
| import torch | |
| from transformers import AutoModel | |
| # Load pre-trained model | |
| model = AutoModel.from_pretrained( | |
| "MIT-SLS/USAD2-Large-Plus", trust_remote_code=True | |
| ).cuda().eval() | |
| # Model properties | |
| model.sample_rate # required audio sample rate | |
| model.encoder_frame_rate # frames per second (Hz) | |
| model.mel_dim # mel feature dimension | |
| model.encoder_dim # hidden dimension | |
| model.num_layers # number of encoder layers | |
| model.device # device | |
| model.dtype # dtype | |
| # Model methods | |
| model.set_audio_chunk_size(30.0) # audio will be chunked if exceeds 30 seconds (default 30s) | |
| # Load audio and resample to 16kHz | |
| wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"]) | |
| # wavs: raw waveforms (batch_size, max_wav_len) | |
| # wav_lengths: length of each sample (batch_size, ) | |
| # You can also load waveforms directly with torchaudio.load | |
| # Extract features | |
| with torch.no_grad(): | |
| results = model( | |
| wavs=wavs, | |
| wav_lengths=wav_lengths, | |
| target_layer=None, # None for last layer, or integer 1 ~ model.num_layers | |
| ) | |
| # result["x"]: model final output (batch_size, seq_len, encoder_dim) | |
| # result["x_lengths"]: valid output lengths after encoder subsampling | |
| # result["x_padding_mask"]: output padding mask, where padding is True | |
| # result["mel"]: mel fbank (batch_size, mel_len, mel_dim) | |
| # result["mel_lengths"]: valid mel lengths before encoder subsampling | |
| # result["hidden_states"]: list of (batch_size, seq_len, encoder_dim) | |
| # result["ffn"]: list of (batch_size, seq_len, encoder_dim) | |
| ``` | |
| * The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency. | |
| * `bfloat16` is preferred for fast inference. | |
| * Avoid using `float16` for numerical stability. | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @inproceedings{chang2026usad2, | |
| title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding}, | |
| author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James}, | |
| booktitle={Interspeech}, | |
| year={2026} | |
| } | |
| ``` | |
| --- | |
| ## π Acknowledgement | |
| Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories. | |