| | --- |
| | license: mit |
| | tags: |
| | - text-to-audio |
| | - controlnet |
| | pipeline_tag: text-to-audio |
| | library_name: diffusers |
| | --- |
| | |
| | <img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true"> |
| |
|
| | # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer |
| |
|
| | [EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819) |
| |
|
| | **Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL . |
| |
|
| | [](https://haidog-yaqub.github.io/EzAudio-Page/) |
| | [](https://arxiv.org/abs/2409.10819) |
| | [](https://huggingface.co/spaces/OpenSound/EzAudio) |
| |
|
| | ๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands. |
| |
|
| | ๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio) |
| |
|
| | ๐ฎ EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet) |
| |
|
| | <!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. --> |
| |
|
| | ## Installation |
| |
|
| | Clone the repository: |
| | ``` |
| | git clone git@github.com:haidog-yaqub/EzAudio.git |
| | ``` |
| | Install the dependencies: |
| | ``` |
| | cd EzAudio |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | Download checkponts (Optional): |
| | [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main) |
| |
|
| | ## Usage |
| |
|
| | You can use the model with the following code: |
| |
|
| | ```python |
| | from api.ezaudio import EzAudio |
| | import torch |
| | import soundfile as sf |
| | |
| | # load model |
| | device = 'cuda' if torch.cuda.is_available() else 'cpu' |
| | ezaudio = EzAudio(model_name='s3_xl', device=device) |
| | |
| | # text to audio genertation |
| | prompt = "a dog barking in the distance" |
| | sr, audio = ezaudio.generate_audio(prompt) |
| | sf.write(f'{prompt}.wav', audio, sr) |
| | |
| | # audio inpainting |
| | prompt = "A train passes by, blowing its horns" |
| | original_audio = 'ref.wav' |
| | sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio, |
| | mask_start=1, mask_length=5) |
| | sf.write(f'{prompt}_edit.wav', audio, sr) |
| | ``` |
| |
|
| | ## Training |
| |
|
| | #### Autoencoder |
| | Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio) |
| |
|
| | #### T2A Diffusion Model |
| | Prepare your data (see example in `src/dataset/meta_example.csv`), then run: |
| |
|
| | ```bash |
| | cd src |
| | accelerate launch train.py |
| | ``` |
| |
|
| | ## Todo |
| | - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio) |
| | - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet) |
| | - [x] Release inference code |
| | - [x] Release training pipeline and dataset |
| | - [x] Improve API and support automatic ckpts downloading |
| | - [ ] Release checkpoints for stage1 and stage2 [WIP] |
| |
|
| | ## Reference |
| |
|
| | If you find the code useful for your research, please consider citing: |
| |
|
| | ```bibtex |
| | @article{hai2024ezaudio, |
| | title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer}, |
| | author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong}, |
| | journal={arXiv preprint arXiv:2409.10819}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgement |
| | Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools). |