EzAudio / README.md

Add pipeline_tag and paper link (#2)

60370bf verified 9 months ago

4.8 kB

	---
	license: mit
	tags:
	- text-to-audio
	- controlnet
	pipeline_tag: text-to-audio
	library_name: diffusers
	---

	<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">

	# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

	[EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)

	Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .

	[![Official Page](https://img.shields.io/badge/Official%20Page-EzAudio-blue?logo=Github&style=flat-square)](https://haidog-yaqub.github.io/EzAudio-Page/)
	[![arXiv](https://img.shields.io/badge/arXiv-2409.10819-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2409.10819)
	[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/OpenSound/EzAudio)

	🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

	🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)

	🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)

	<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->

	## Installation

	Clone the repository:
	```
	git clone git@github.com:haidog-yaqub/EzAudio.git
	```
	Install the dependencies:
	```
	cd EzAudio
	pip install -r requirements.txt
	```

	Download checkponts (Optional):
	[https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)

	## Usage

	You can use the model with the following code:

	```python
	from api.ezaudio import EzAudio
	import torch
	import soundfile as sf

	# load model
	device = 'cuda' if torch.cuda.is_available() else 'cpu'
	ezaudio = EzAudio(model_name='s3_xl', device=device)

	# text to audio genertation
	prompt = "a dog barking in the distance"
	sr, audio = ezaudio.generate_audio(prompt)
	sf.write(f'{prompt}.wav', audio, sr)

	# audio inpainting
	prompt = "A train passes by, blowing its horns"
	original_audio = 'ref.wav'
	sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
	mask_start=1, mask_length=5)
	sf.write(f'{prompt}_edit.wav', audio, sr)
	```

	## Training

	#### Autoencoder
	Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)

	#### T2A Diffusion Model
	Prepare your data (see example in `src/dataset/meta_example.csv`), then run:

	```bash
	cd src
	accelerate launch train.py
	```

	## Todo
	- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
	- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
	- [x] Release inference code
	- [x] Release training pipeline and dataset
	- [x] Improve API and support automatic ckpts downloading
	- [ ] Release checkpoints for stage1 and stage2 [WIP]

	## Reference

	If you find the code useful for your research, please consider citing:

	```bibtex
	@article{hai2024ezaudio,
	title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
	author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
	journal={arXiv preprint arXiv:2409.10819},
	year={2024}
	}
	```

	## Acknowledgement
	Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).