| | --- |
| | license: cc-by-nc-sa-4.0 |
| | pipeline_tag: feature-extraction |
| | tags: |
| | - automatic-speech-recognition |
| | - audio-classification |
| | - audio |
| | - speech |
| | - music |
| | library_name: transformers |
| | datasets: |
| | - openslr/librispeech_asr |
| | - facebook/multilingual_librispeech |
| | - mozilla-foundation/common_voice_17_0 |
| | - speechcolab/gigaspeech |
| | - facebook/voxpopuli |
| | - agkphysics/AudioSet |
| | language: |
| | - en |
| | --- |
| | # USAD: Universal Speech and Audio Representation via Distillation |
| |
|
| | **Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers. |
| | Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model. |
| |
|
| | [π **Read Full Paper**](https://arxiv.org/abs/2506.18843) |
| |
|
| | --- |
| |
|
| | ## ποΈ Models |
| |
|
| | USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**. |
| |
|
| | | Model | Parameters | Dim | Layer | Checkpoint | |
| | | ---------- | ---------- | ---- | ----- | ------------------------------------------------- | |
| | | USAD Small | 24M | 384 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Small) | |
| | | USAD Base | 94M | 768 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Base) | |
| | | USAD Large | 330M | 1024 | 24 | [link](https://huggingface.co/MIT-SLS/USAD-Large) | |
| |
|
| | --- |
| |
|
| |
|
| | ## π How To Use |
| |
|
| | **Installation** |
| | ``` |
| | pip install -U transformers |
| | ``` |
| |
|
| | **Load Model and Extract Features** |
| | ```python |
| | import torch |
| | from transformers import AutoModel |
| | |
| | # Load pre-trained model |
| | model = AutoModel.from_pretrained("MIT-SLS/USAD-Large", trust_remote_code=True).cuda().eval() |
| | |
| | # Load audio and resample to 16kHz |
| | wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len) |
| | # wav is a float tensor on the same device as the model |
| | # You can also load waveforms directly with torchaudio.load |
| | |
| | # Extract features |
| | with torch.no_grad(): |
| | results = model(wav) |
| | |
| | # result["x"]: model final output (batch_size, seq_len) |
| | # result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim) |
| | # result["hidden_states"]: list of (batch_size, seq_len, encoder_dim) |
| | # result["ffn"]: list of (batch_size, seq_len, encoder_dim) |
| | ``` |
| |
|
| | See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Large/blob/main/usad_model.py) for more details about the model. |
| |
|
| | --- |
| |
|
| | ## π Citation |
| |
|
| | ```bibtex |
| | @article{chang2025usad, |
| | title={{USAD}: Universal Speech and Audio Representation via Distillation}, |
| | author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.}, |
| | journal={arXiv preprint arXiv:2506.18843}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Acknowledgement |
| |
|
| | Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories. |
| |
|