Upload README.md

eaaabb9 verified 1 day ago

7.92 kB

	# MIDI Generation Pipeline: Text-to-Music

	Complete pipelines for training and inference of text-conditioned MIDI generation using both GPT2-style and Qwen3-based autoregressive models.

	## Two Architectures

	### 1. GPT2-Style (`train_midi_gpt.py`)
	- From-scratch GPT2 model with custom vocabulary
	- ~50M parameters (configurable)
	- Fast training, good for experimentation

	### 2. Qwen3-0.6B (`train_midi_qwen3.py`) ⭐ Recommended
	- Pretrained LLM with vocabulary expansion (inspired by MIDI-LLM)
	- 751M parameters with rich text understanding
	- Tied embeddings automatically handled
	- Apache-2.0 license

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `prepare_dataset.py` \| Preprocess for GPT2 pipeline \|
	\| `prepare_dataset_qwen3.py` \| Preprocess for Qwen3 pipeline (rich prompts) \|
	\| `train_midi_gpt.py` \| Train GPT2-style model \|
	\| `train_midi_qwen3.py` \| Fine-tune Qwen3-0.6B with MIDI vocab expansion \|
	\| `inference_midi_gpt.py` \| Generate MIDI with GPT2 model \|
	\| `inference_midi_qwen3.py` \| Generate MIDI with Qwen3 model (for local checkpoints) \|
	\| `inference_trained_model.py` \| Generate MIDI with trained model from HF Hub \|
	\| `create_synthetic_dataset.py` \| Generate synthetic test data \|
	\| `test_end_to_end.py` \| Validate GPT2 pipeline \|
	\| `test_qwen3_e2e.py` \| Validate Qwen3 pipeline \|
	\| `run_qwen3_training.py` \| One-command GPU training script \|

	## Trained Model Available

	[rahuldshetty/midi-qwen3-v1](https://huggingface.co/rahuldshetty/midi-qwen3-v1)

	A trained Qwen3-0.6B model with expanded MIDI vocabulary (152,188 tokens total).

	### Quick Inference with Trained Model

	```bash
	# Install dependencies
	pip install transformers torch datasets miditok miditoolkit accelerate

	# Generate MIDI from a text prompt (simplest way)
	python inference_trained_model.py \
	--model_id rahuldshetty/midi-qwen3-v1 \
	--prompt "A dark electronic piece with synth strings in D minor, 140 BPM" \
	--output_path my_song.mid \
	--max_midi_tokens 1024 \
	--temperature 1.0 \
	--top_k 50 \
	--top_p 0.92

	# Use a random prompt from the dataset
	python inference_trained_model.py \
	--model_id rahuldshetty/midi-qwen3-v1 \
	--dataset_prompt \
	--num_samples 3 \
	--output_path generated.mid

	# Generate multiple variations
	python inference_trained_model.py \
	--model_id rahuldshetty/midi-qwen3-v1 \
	--prompt "A cheerful piano piece in C major, 120 BPM" \
	--num_samples 5 \
	--temperature 0.8 \
	--output_path song.mid
	```

	### Python API for Inference

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from huggingface_hub import snapshot_download
	from miditok import REMI
	from pathlib import Path
	import json, tempfile, shutil

	# 1. Download model
	model_id = "rahuldshetty/midi-qwen3-v1"
	temp_dir = Path(tempfile.mkdtemp())
	snapshot_download(repo_id=model_id, local_dir=str(temp_dir))

	# 2. Load and expand tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B", trust_remote_code=True)
	with open(temp_dir / "config.json") as f:
	config = json.load(f)
	expanded_vocab = config["vocab_size"] # 152188
	n_added = expanded_vocab - len(tokenizer) # 519 tokens added
	midi_vocab_size = n_added - 3 # 516 MIDI tokens

	# Add tokens
	midi_tokens = [f"<\|midi_{i}\|>" for i in range(midi_vocab_size)]
	tokenizer.add_tokens(["<\|midi_start\|>", "<\|midi_end\|>", "<\|midi_pad\|>"] + midi_tokens)

	# 3. Load model
	model = AutoModelForCausalLM.from_pretrained(
	str(temp_dir),
	trust_remote_code=True,
	torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
	device_map="auto" if torch.cuda.is_available() else None,
	)
	model.eval()

	# 4. Build prompt and tokenize
	prompt = "A cheerful jazz piece with piano and saxophone in C major, 120 BPM"
	text_ids = tokenizer.encode(prompt, add_special_tokens=False)

	# Special tokens
	bos_midi = tokenizer.convert_tokens_to_ids("<\|midi_start\|>")
	eos_midi = tokenizer.convert_tokens_to_ids("<\|midi_end\|>")
	pad_midi = tokenizer.convert_tokens_to_ids("<\|midi_pad\|>")
	midi_offset = 151936 + 3 # original_vocab + num_special

	# 5. Generate MIDI tokens
	input_tensor = torch.tensor([text_ids + [bos_midi]], dtype=torch.long, device=model.device)
	with torch.no_grad():
	output = model.generate(
	input_tensor,
	max_new_tokens=512,
	do_sample=True,
	temperature=1.0,
	top_k=50,
	top_p=0.92,
	pad_token_id=pad_midi,
	eos_token_id=eos_midi,
	)

	# 6. Extract and decode MIDI tokens
	generated = output[0].tolist()
	bos_idx = generated.index(bos_midi)
	midi_raw = generated[bos_idx + 1:]
	midi_raw = [t for t in midi_raw if t not in (eos_midi, pad_midi)]
	midi_ids = [t - midi_offset for t in midi_raw if t >= midi_offset]
	midi_ids = [t for t in midi_ids if 0 <= t < midi_vocab_size]

	# 7. Decode to MIDI file
	midi_tokenizer = REMI(params=str(temp_dir / "midi_tokenizer_init"))
	midi = midi_tokenizer.decode(midi_ids)
	midi.dump("output.mid")
	print(f"Generated {len(midi_ids)} MIDI tokens → output.mid")

	# Cleanup
	shutil.rmtree(temp_dir, ignore_errors=True)
	```

	## Dataset

	Processed dataset: [rahuldshetty/midi-generation-dataset](https://huggingface.co/datasets/rahuldshetty/midi-generation-dataset)

	Source: [B-K/midi-dataset-2](https://huggingface.co/datasets/B-K/midi-dataset-2) (MidiCaps with MIDI bytes)

	## Quick Start: Train Your Own

	### 1. Install Dependencies

	```bash
	pip install transformers torch datasets miditok miditoolkit accelerate
	```

	### 2. Prepare Dataset

	```bash
	python prepare_dataset_qwen3.py \
	--dataset B-K/midi-dataset-2 \
	--output_dir ./midi_data_qwen3 \
	--max_seq_len 2048
	```

	### 3. Train

	```bash
	python train_midi_qwen3.py \
	--dataset rahuldshetty/midi-generation-dataset \
	--output_dir ./midi_qwen3_model \
	--num_epochs 20 \
	--batch_size 2 \
	--gradient_accumulation_steps 8 \
	--bf16 \
	--gradient_checkpointing \
	--push_to_hub \
	--hub_model_id yourname/midi-qwen3-v1
	```

	Or use the one-command script:
	```bash
	python run_qwen3_training.py
	```

	### 4. Generate MIDI

	```bash
	python inference_midi_qwen3.py \
	--model_dir ./midi_qwen3_model/final \
	--prompt "A cheerful jazz piece with piano and saxophone in C major, 120 BPM" \
	--output_path output.mid \
	--max_midi_tokens 1024
	```

	## Qwen3 Architecture Details

	### Vocabulary Expansion
	- Qwen3 base vocab: 151,936 tokens
	- MIDI special tokens: `<\|midi_start\|>`, `<\|midi_end\|>`, `<\|midi_pad\|>`
	- MIDI vocab tokens: `<\|midi_0\|>` ... `<\|midi_515\|>` (REMI tokenization)
	- Total vocab: ~152,455

	### Training Labels
	- Text prefix → `-100` (not trained on)
	- MIDI tokens + special tokens → actual IDs
	- Model learns only music generation

	### Rich Prompt Format
	```
	You are a world-class composer. Please compose some music according to the following description:
	Description: [caption]
	Genre: [genre]
	Mood: [mood]
	Key: [key]
	Time Signature: [time_signature]
	Tempo: [tempo] BPM
	Duration: [duration] seconds
	Instruments: [instruments]
	Chords: [chords]
	```

	## Recommended Datasets

	\| Dataset \| Link \| Description \|
	\|---------\|------\|-------------\|
	\| B-K/midi-dataset-2 \| [HF](https://huggingface.co/datasets/B-K/midi-dataset-2) \| Best - rich metadata + MIDI bytes \|
	\| amaai-lab/MidiCaps \| [HF](https://huggingface.co/datasets/amaai-lab/MidiCaps) \| 168K captions (no MIDI bytes) \|
	\| foldl/midi \| [HF](https://huggingface.co/datasets/foldl/midi) \| Name + genre + MIDI bytes \|

	## Hardware Recommendations

	\| Model \| GPU \| Batch \| Notes \|
	\|-------\|-----\|-------\|-------\|
	\| GPT2 (50M) \| t4-small \| 4 \| Fast experimentation \|
	\| Qwen3-0.6B \| a10g-large \| 2 \| Enable gradient_checkpointing \|
	\| Qwen3-0.6B \| a100-large \| 4 \| Full training \|

	## SOTA References

	- MIDI-LLM (Wu et al., 2025): LLM vocab expansion for MIDI
	- MIDI-GPT (Pasquier et al., 2025): GPT2 for MIDI
	- text2midi (Bhandari et al., AAAI 2025): T5 encoder + decoder