Add USAD2 model

17e49ef verified about 15 hours ago

7.58 kB

	---
	license: cc-by-nc-sa-4.0
	pipeline_tag: feature-extraction
	tags:
	- automatic-speech-recognition
	- audio-classification
	- audio
	- speech
	- music
	library_name: transformers
	datasets:
	- openslr/librispeech_asr
	- facebook/multilingual_librispeech
	- mozilla-foundation/common_voice_17_0
	- speechcolab/gigaspeech
	- facebook/voxpopuli
	- espnet/mms_ulab_v2
	- google/fleurs
	- AISHELL/AISHELL-1
	- kresnik/zeroth_korean
	- ylacombe/expresso
	- agkphysics/AudioSet
	- 11hu83/vggsound
	- benjamin-paine/free-music-archive-full
	- rkstgr/mtg-jamendo
	language:
	- en
	---
	# USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

	USAD 2.0 is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing ([HEAR](https://arxiv.org/abs/2203.03022) and [MARBLE](https://arxiv.org/abs/2306.10548)) and LLM-based evaluations ([XARES-LLM](https://arxiv.org/abs/2603.22728)).

	Training data:
	* Multilingual speech (116k hours)
	* General audio and sound (21k hours)
	* Music (13k hours)


	[👀 Read Full Paper](https://arxiv.org/abs/2606.06444)

	---

	## 🗂️ Models

	### Self-supervised Teachers (WavLM, ATST, MuQ): General-purpose encoders with good probing performance

	\| Model \| Params \| Hidden \| Layers \| Framerate \|
	\|:----------------------------------------------------- \| ------:\| ------:\| ------:\| ---------:\|
	\| [USAD 2.0 Small](https://hf.co/MIT-SLS/USAD2-Small) \| 25M \| 384 \| 12 \| 50Hz \|
	\| [USAD 2.0 Base](https://hf.co/MIT-SLS/USAD2-Base) \| 97M \| 768 \| 12 \| 50Hz \|
	\| [USAD 2.0 Large](https://hf.co/MIT-SLS/USAD2-Large) \| 336M \| 1024 \| 24 \| 50Hz \|
	\| [USAD 2.0 XLarge](https://hf.co/MIT-SLS/USAD2-XLarge) \| 695M \| 1280 \| 32 \| 25Hz \|

	### Supervised Teachers (Whisper & Audio Flamingo 3): State-of-the-art encoders for audio LLM frontend
	We suggest selecting the best layer with the `target_layer` argument in the forward function to optimize audio LLM performance.

	\| Model \| Params \| Hidden \| Layers (Best) \| Framerate \|
	\|:------------------------------------------------------------- \| ------:\| ------:\| -------------:\| ---------:\|
	\| [USAD 2.0 Large+](https://hf.co/MIT-SLS/USAD2-Large-Plus) \| 336M \| 1024 \| 24 (20) \| 50Hz \|
	\| [USAD 2.0 XLarge+](https://hf.co/MIT-SLS/USAD2-XLarge-Plus) \| 695M \| 1280 \| 32 (28) \| 25Hz \|
	\| [USAD 2.0 XXLarge+](https://hf.co/MIT-SLS/USAD2-XXLarge-Plus) \| 1036M \| 1280 \| 48 (40) \| 25Hz \|

	---

	## ⚙️ Performance
	- [HEAR](https://arxiv.org/abs/2203.03022): probing-based general audio evaluation covering speech, sound, and music
	- [MARBLE](https://arxiv.org/abs/2306.10548): probing-based music capability benchmark (instruments and singing voice)
	- [XARES-LLM](https://github.com/xiaomi-research/xares-llm): frozen audio encoder + LLM with multi-task LoRA fine-tuning
	- Track A (classification): keyword spotting, speaker/language identification, spoof detection, intent/emotion/sound/genre/instrument classification, and sound event detection.
	- Track B (understanding): English/Mandarin ASR and audio/music captioning

	\| Encoder \| Params \| HEAR \| MARBLE \| XARES-LLM-A \| XARES-LLM-B \|
	\| :---------------------- \| ------:\| --------:\| --------:\| -----------:\| -----------:\|
	\| Single-encoder SOTA \| \| \| \| \| \|
	\| &ensp; Base \| ~90M \| 80.6 \| 74.0 \| 0.660 \| 0.418 \|
	\| &ensp; Large \| ~300M \| 81.8 \| 77.0 \| 0.691 \| 0.454 \|
	\| &ensp; XLarge \| ~600M \| 82.6 \| 75.1 \| 0.782 \| 0.457 \|
	\| USAD 2.0 \| \| \| \| \| \|
	\| &ensp; Small \| 25M \| 81.0 \| 72.9 \| 0.604 \| 0.357 \|
	\| &ensp; Base \| 97M \| 81.9 \| 74.1 \| 0.645 \| 0.442 \|
	\| &ensp; Large \| 336M \| 82.9 \| 75.8 \| 0.667 \| 0.473 \|
	\| &ensp; XLarge \| 695M \| 82.5 \| 75.7 \| 0.708 \| 0.485 \|
	\| USAD 2.0+ \| \| \| \| \| \|
	\| &ensp; Large+ \| 336M \| 84.0 \| 75.1 \| 0.769 \| 0.580 \|
	\| &ensp; XLarge+ \| 695M \| 84.4 \| 75.0 \| 0.772 \| 0.611 \|
	\| &ensp; XXLarge+ \| 1036M \| 84.4 \| 75.6 \| 0.783 \| 0.624 \|

	* The above evaluations are based on frozen encoders.
	* We encourage fine-tuning USAD 2.0 models for optimal downstream task performance.

	---

	## 🚀 How To Use

	Installation
	```
	pip install -U torch torchaudio transformers
	```

	Load Model and Extract Features
	```python
	import torch
	from transformers import AutoModel

	# Load pre-trained model
	model = AutoModel.from_pretrained(
	"MIT-SLS/USAD2-XXLarge-Plus", trust_remote_code=True
	).cuda().eval()

	# Model properties
	model.sample_rate # required audio sample rate
	model.encoder_frame_rate # frames per second (Hz)
	model.mel_dim # mel feature dimension
	model.encoder_dim # hidden dimension
	model.num_layers # number of encoder layers
	model.device # device
	model.dtype # dtype

	# Model methods
	model.set_audio_chunk_size(30.0) # audio will be chunked if exceeds 30 seconds (default 30s)

	# Load audio and resample to 16kHz
	wavs, wav_lengths = model.load_audio_batch(["audio1.wav", "audio2.wav"])
	# wavs: raw waveforms (batch_size, max_wav_len)
	# wav_lengths: length of each sample (batch_size, )
	# You can also load waveforms directly with torchaudio.load

	# Extract features
	with torch.no_grad():
	results = model(
	wavs=wavs,
	wav_lengths=wav_lengths,
	target_layer=None, # None for last layer, or integer 1 ~ model.num_layers
	)

	# result["x"]: model final output (batch_size, seq_len, encoder_dim)
	# result["x_lengths"]: valid output lengths after encoder subsampling
	# result["x_padding_mask"]: output padding mask, where padding is True
	# result["mel"]: mel fbank (batch_size, mel_len, mel_dim)
	# result["mel_lengths"]: valid mel lengths before encoder subsampling
	# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
	# result["ffn"]: list of (batch_size, seq_len, encoder_dim)
	```

	* The self-attention mechanism is implemented with [SDPA](https://pytorch.org/blog/out-of-the-box-acceleration/), you may install FlashAttention to optimize inference efficiency.
	* `bfloat16` is preferred for fast inference.
	* Avoid using `float16` for numerical stability.

	---

	## 📖 Citation

	```bibtex
	@inproceedings{chang2026usad2,
	title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
	author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
	booktitle={Interspeech},
	year={2026}
	}
	```

	---

	## 🙏 Acknowledgement

	Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.