Instructions to use DaveLoay/Riffusion_FineTuning_Tutorial with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use DaveLoay/Riffusion_FineTuning_Tutorial with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FineTuning_Tutorial", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
| datasets: | |
| - DaveLoay/NSynth_Bass_Captions | |
| language: | |
| - en | |
| # Riffsuion Fine-Tune | |
| This is a Fine-Tuned version of **Rifussion**, trained on **bass** samples extracted from the **NSynth** dataset. | |
| The porpuse of this work is to evaluate the performance of the model to generate bass audio samples. | |
| ## Notes | |
| * This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me. | |
| ## Quickstart Guide | |
| Clone the **Riffusion** repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion) | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device) | |
| prompt = "Your desired prompt" | |
| image = pipe(prompt).images[0] | |
| ``` | |
| After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the **spectrogram_image_converter** mehtod contained in the **Rifussion** repo. | |
| ```python | |
| from riffusion.spectrogram_image_converter import SpectrogramImageConverter | |
| from riffusion.spectrogram_params import SpectrogramParams | |
| params = SpectrogramParams() | |
| converter = SpectrogramImageConverter(params) | |
| audio = converter.audio_from_spectrogram_image(image) | |
| ``` | |
| ## Fine Tuning | |
| For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth) | |
| You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions) | |
| And as mention in the official **Rifussion** HF repo, I used the **train_text_to_image** script contained in the **Diffusers** repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image) | |
| After configuring all dependencies, I used the following code to train the model: | |
| ```bash | |
| accelerate launch --mixed_precision="fp16" train_text_to_image.py \ | |
| --pretrained_model_name_or_path=riffusion/riffusion-model-v1 \ | |
| --dataset_name=DaveLoay/NSynth_Bass_Captions \ | |
| --resolution=512 \ | |
| --use_ema \ | |
| --train_batch_size=3 \ | |
| --gradient_accumulation_steps=4 \ | |
| --gradient_checkpointing \ | |
| --max_train_steps=4000 \ | |
| --learning_rate=1e-05 \ | |
| --max_grad_norm=1 \ | |
| --lr_scheduler="constant" --lr_warmup_steps=0 \ | |
| --output_dir="Riffusion_FT_Bass_512_4000" \ | |
| --push_to_hub | |
| ``` | |
| ## Hardware | |
| The hardware I used to fine-tune this model is: | |
| * NVIDIA A100 40 GB vRAM hosted in Google Colab Pro | |
| It took about 3 hours to complete the training process, and used about ~26 GB of vRAM. | |
| ## Credits | |
| You can check the original repositories here: | |
| [Riffusion](https://www.riffusion.com/) | |
| [NSynth Dataset](https://magenta.tensorflow.org/nsynth) | |
| [Diffusers](https://huggingface.co/docs/diffusers/index) | |