--- license: cc-by-nc-sa-4.0 pipeline_tag: feature-extraction tags: - automatic-speech-recognition - audio-classification - audio - speech - music library_name: transformers datasets: - openslr/librispeech_asr - facebook/multilingual_librispeech - mozilla-foundation/common_voice_17_0 - speechcolab/gigaspeech - facebook/voxpopuli - espnet/mms_ulab_v2 - google/fleurs - AISHELL/AISHELL-1 - kresnik/zeroth_korean - ylacombe/expresso - agkphysics/AudioSet - 11hu83/vggsound - benjamin-paine/free-music-archive-full - rkstgr/mtg-jamendo language: - en --- # USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding **USAD 2.0** is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)). Training data: * Multilingual speech (116k hours) * General audio and sound (21k hours) * Music (13k hours) [👀 **Read Full Paper**](https://arxiv.org/abs/2606.06444) --- ## 🗂️ Models ### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance | Model | Params | Hidden | Layers | Framerate | |:----------------------------------------------------- | ------:| ------:| ------:| ---------:| | [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) | 25M | 384 | 12 | 50Hz | | [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) | 97M | 768 | 12 | 50Hz | | [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) | 336M | 1024 | 24 | 50Hz | | [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) | 695M | 1280 | 32 | 25Hz | ### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance. | Model | Params | Hidden | Layers (Best) | Framerate | |:------------------------------------------------------------- | ------:| ------:| -------------:| ---------:| | [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) | 336M | 1024 | 24 (20) | 50Hz | | [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) | 695M | 1280 | 32 (28) | 25Hz | | [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) | 1036M | 1280 | 48 (40) | 25Hz | --- ## ⚙️ Performance - [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music - [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice) - [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning - Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection. - Track B (understanding): English/Mandarin ASR and audio/music captioning | Encoder | Params | HEAR | MARBLE | XARES-LLM-A | XARES-LLM-B | | :---------------------- | ------:| --------:| --------:| -----------:| -----------:| | **Single-encoder SOTA** | | | | | | |   Base | ~90M | 80.6 | 74.0 | 0.660 | 0.418 | |   Large | ~300M | 81.8 | **77.0** | 0.691 | 0.454 | |   XLarge | ~600M | 82.6 | 75.1 | 0.782 | 0.457 | | **USAD 2.0** | | | | | | |   Small | 25M | 81.0 | 72.9 | 0.604 | 0.357 | |   Base | 97M | 81.9 | 74.1 | 0.645 | 0.442 | |   Large | 336M | 82.9 | 75.8 | 0.667 | 0.473 | |   XLarge | 695M | 82.5 | 75.7 | 0.708 | 0.485 | | **USAD 2.0+** | | | | | | |   Large+ | 336M | 84.0 | 75.1 | 0.769 | 0.580 | |   XLarge+ | 695M | **84.4** | 75.0 | 0.772 | 0.611 | |   XXLarge+ | 1036M | **84.4** | 75.6 | **0.783** | **0.624** | * The above evaluations are based on *frozen* encoders. * We encourage fine-tuning USAD 2.0 models for optimal downstream task performance. --- ## 🚀 How To Use **Installation** ``` pip install -U torch torchaudio transformers ``` **Load Model and Extract Features** ```python import torch from transformers import AutoModel # Load pre-trained model model = AutoModel.from_pretrained( "MIT-SLS/USAD2-XLarge-Plus", trust_remote_code=True ).cuda().eval() # Model properties model.sample_rate # required audio sample rate model.encoder_frame_rate # frames per second (Hz) model.mel_dim # mel feature dimension model.encoder_dim # hidden dimension model.num_layers # number of encoder layers model.device # device model.dtype # dtype # Model methods model.set_audio_chunk_size(30.0) # audio will be chunked if exceeds 30 seconds (default 30s) # Load audio and resample to 16kHz wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"]) # wavs: raw waveforms (batch_size, max_wav_len) # wav_lengths: length of each sample (batch_size, ) # You can also load waveforms directly with torchaudio.load # Extract features with torch.no_grad(): results = model( wavs=wavs, wav_lengths=wav_lengths, target_layer=None, # None for last layer, or integer 1 ~ model.num_layers ) # result["x"]: model final output (batch_size, seq_len, encoder_dim) # result["x_lengths"]: valid output lengths after encoder subsampling # result["x_padding_mask"]: output padding mask, where padding is True # result["mel"]: mel fbank (batch_size, mel_len, mel_dim) # result["mel_lengths"]: valid mel lengths before encoder subsampling # result["hidden_states"]: list of (batch_size, seq_len, encoder_dim) # result["ffn"]: list of (batch_size, seq_len, encoder_dim) ``` * The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency. * `bfloat16` is preferred for fast inference. * Avoid using `float16` for numerical stability. --- ## 📖 Citation ```bibtex @inproceedings{chang2026usad2, title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding}, author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James}, booktitle={Interspeech}, year={2026} } ``` --- ## 🙏 Acknowledgement Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.