| | --- |
| | license: apache-2.0 |
| | pipeline_tag: audio-text-to-text |
| | library_name: glap_model |
| | --- |
| | |
| | <div align="center"> |
| | <h1> |
| | GLAP (Generalized Language Audio Pretraining) |
| | </h1> |
| | <p> |
| | Official PyTorch code for <b>GLAP</b> <br> |
| | <b><em>Generalized Language Audio Pretraining</em></b> |
| | </p> |
| | </p> |
| | <a href="https://arxiv.org/abs/2506.11350"><img src="https://img.shields.io/badge/arXiv-2506.11350-b31b1b" alt="version"></a> |
| | <a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a> |
| | <a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a> |
| | <a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a> |
| | <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a> |
| | <img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads"> |
| | |
| | </div> |
| |
|
| |
|
| |
|
| |
|
| | # GLAP (Generalized Language Audio Pretraining) |
| |
|
| |
|
| | <img src="capabilities.png" alt="GLAP capabiltiies" style="height: 600px;"> |
| |
|
| |
|
| | ## Features |
| |
|
| |
|
| | * *First* all-in-one solution for general audio-text retrieval. |
| | * Multilingual (8 + Languages) Speech, Music and Sound retrieval. |
| | * Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more. |
| |
|
| |
|
| | ## Usage |
| |
|
| |
|
| | ```bash |
| | pip install glap_model |
| | ``` |
| |
|
| |
|
| | ### Scoring audio-text pairs |
| |
|
| | We provide a simple commandline tool: |
| |
|
| | ```bash |
| | score_glap audio_input_file text1;text2;text3 |
| | ``` |
| |
|
| | Or in Python: |
| |
|
| | ```python |
| | import torch |
| | from glap_model import glap_inference |
| | |
| | audio = torch.randn(1, 160000).tanh() # 10s of heavy noise |
| | |
| | glap_model = glap_inference() |
| | |
| | score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"]) |
| | print(score) |
| | ``` |
| |
|
| |
|
| |
|
| | ### Recommended Prompts |
| |
|
| | | Task | Prompt | |
| | |--------|-----------------------------------------| |
| | | Speech | {label} | |
| | | Music | The music in the style of {label}. | |
| | | Sound | The sound of {label} can be heard. | |
| |
|
| |
|
| | ### Batched scoring |
| |
|
| |
|
| | ```python |
| | import torch |
| | from glap_model import glap_inference |
| | |
| | glap_model = glap_inference() |
| | audio = torch.randn(1, 64000).tanh() |
| | prefix = "The sound of" |
| | labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")] |
| | text_embeds = glap_model.encode_text(labels) |
| | audio_embeds = glap_model.encode_audio(audio) |
| | scores = glap_model.score(audio_embeds, text_embeds) |
| | for label_name, score in zip(labels, scores): |
| | print(label_name,score) |
| | |
| | |
| | ``` |
| |
|
| | ## Development |
| |
|
| |
|
| | ### UV (Recommended) |
| |
|
| | ```bash |
| | git clone https://github.com/xiaomi-research/GLAP |
| | cd GLAP |
| | uv venv --python 3.10 |
| | source activate .venv/bin/activate |
| | uv sync |
| | |
| | #python3 -m pip install . |
| | # Additionally, sndfile is needed |
| | # conda install -c conda-forge libsndfile==1.0.31 |
| | ``` |
| |
|
| | ### Pip |
| |
|
| | ```bash |
| | git clone https://github.com/xiaomi-research/GLAP |
| | cd GLAP |
| | python3 -m pip install . |
| | # Additionally, sndfile is needed |
| | # conda install -c conda-forge libsndfile==1.0.31 |
| | # Or if you have root, use your package manager |
| | ``` |
| |
|
| |
|
| | ### Prepare data |
| |
|
| |
|
| | Data needs to be in `tar/tar.gz` format: |
| |
|
| | ``` |
| | # tar -tf a.tar |
| | 908-31957-0013.flac |
| | 908-31957-0013.json |
| | 2961-960-0013.flac |
| | 2961-960-0013.json |
| | ``` |
| |
|
| |
|
| | Each `.json` should have one of three fields `caption`, `captions` or `text`. |
| | Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency. |
| | Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training). |
| |
|
| | ### Training |
| |
|
| |
|
| | For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`. |
| |
|
| |
|
| | ```bash |
| | accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml |
| | ``` |
| |
|
| |
|
| | ### Zeroshot eval (one sample) |
| |
|
| |
|
| | ```bash |
| | # There ; is a separator for different text keys |
| | python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three" |
| | ``` |
| |
|
| | ### Retrieval scoring |
| |
|
| | ```bash |
| | # Should be run on a single GPU |
| | accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT |
| | ``` |
| |
|
| |
|
| |
|
| | ### Notes on DDP |
| |
|
| | Using uneven training datasets without `resample=True` is not recommended |
| |
|
| |
|
| | ## Translating data into a target language |
| |
|
| | For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code: |
| |
|
| |
|
| | ```bash |
| | python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/ |
| | ``` |
| |
|
| | DDP is also supported: |
| |
|
| | ```bash |
| | accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/ |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{2506.11350, |
| | Author = {Heinrich Dinkel and Zhiyong Yan and Tianzi Wang and Yongqing Wang and Xingwei Sun and Yadong Niu and Jizhong Liu and Gang Li and Junbo Zhang and Jian Luan}, |
| | Title = {GLAP: General contrastive audio-text pretraining across domains and languages}, |
| | Year = {2025}, |
| | Eprint = {arXiv:2506.11350}, |
| | } |
| | ``` |