Add files using upload-large-folder tool

a4d9876 verified 26 days ago

4.23 kB

	# OmniVoice Examples

	This directory contains scripts and configs for training, fine-tuning, and evaluating OmniVoice.

	\| Use Case \| Script \| Description \|
	\|---\|---\|---\|
	\| Training from scratch \| [run_emilia.sh](run_emilia.sh) \| Full pipeline on the Emilia dataset (data check, tokenization, training) \|
	\| Fine-tuning \| [run_finetune.sh](run_finetune.sh) \| Fine-tune from a pretrained checkpoint using your own JSONL data \|
	\| Evaluation \| [run_eval.sh](run_eval.sh) \| Evaluate WER, speaker similarity, and UTMOS on standard test sets \|

	---

	## Training from Scratch (Emilia)

	[run_emilia.sh](run_emilia.sh) runs the full pipeline in 3 stages:

	\| Stage \| What it does \|
	\|---\|---\|
	\| 0 \| Verify the Emilia dataset and JSONL manifests are in place \|
	\| 1 \| Tokenize audio into WebDataset shards \|
	\| 2 \| Launch multi-GPU training with `accelerate` \|

	Prerequisites:

	1. Download the Emilia dataset from [OpenXLab](https://openxlab.org.cn/datasets/Amphion/Emilia) and place it under `download/`:
	```
	download/Amphion___Emilia
	└── raw
	├── EN
	└── ZH
	```
	2. Obtain JSONL manifests and place them in `data/emilia/manifests/`:
	- `emilia_en_train.jsonl`, `emilia_en_dev.jsonl`
	- `emilia_zh_train.jsonl`, `emilia_zh_dev.jsonl`

	You can generate them from the raw data, or download pre-processed manifests from [HuggingFace](https://huggingface.co/datasets/zhu-han/Emilia-Manifests).

	Run the full pipeline:

	```bash
	bash examples/run_emilia.sh
	```

	Or run individual stages by setting `stage` and `stop_stage` at the top of the script (e.g. `stage=1`, `stop_stage=1` to only tokenize).

	> See [docs/training.md](../docs/training.md) for config details, checkpoint resuming, and TensorBoard monitoring.

	---

	## Fine-tuning

	[run_finetune.sh](run_finetune.sh) fine-tunes from a pretrained checkpoint on your own data.

	### Step 1: Prepare Your Data

	Create a JSONL manifest where each line describes one audio sample:

	```jsonl
	{"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"}
	{"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "你好世界", "language_id": "zh"}
	```

	`id`, `audio_path`, and `text` are mandatory. `language_id` is optional.

	> See [docs/data_preparation.md](../docs/data_preparation.md) for the full data format specification.

	### Step 2: Configure the Script

	Edit the variables at the top of `run_finetune.sh`:

	```bash
	TRAIN_JSONL="data/my_data_train.jsonl" # path to training JSONL
	DEV_JSONL="data/my_data_dev.jsonl" # path to dev JSONL
	GPU_IDS="0,1" # GPUs to use
	NUM_GPUS=2
	OUTPUT_DIR="exp/omnivoice_finetune" # output directory
	```

	### Step 3: Run

	```bash
	bash examples/run_finetune.sh
	```

	The script will:
	1. Tokenize your audio into WebDataset shards
	2. Launch fine-tuning with `accelerate`

	Main difference between fine-tuning config ([config/train_config_finetune.json](config/train_config_finetune.json)) and the Emilia training config ([config/train_config_emilia.json](config/train_config_emilia.json)) are:

	\| Parameter \| Emilia (from scratch) \| Fine-tune \| Why \|
	\|---\|---\|---\|---\|
	\| `init_from_checkpoint` \| `null` \| `"k2-fsa/OmniVoice"` \| Load pretrained weights \|
	\| `steps` \| 300,000 \| 5,000 \| Fewer steps for fine-tuning, can be tuned according to your data/task. \|
	\| `learning_rate` \| 1e-4 \| 5e-5 \| Lower LR for fine-tuning, can be tuned according to your data/task \|

	To use a different pretrained checkpoint, modify `init_from_checkpoint` in the config file.

	If you encounter issues with `flex_attention` on your GPU, use [config/train_config_finetune_sdpa.json](config/train_config_finetune_sdpa.json) instead, which uses SDPA attention for broader compatibility. See [docs/training.md](../docs/training.md#attention-implementation) for details.

	---

	## Evaluation

	Install evaluation dependencies first:

	```bash
	pip install omnivoice[eval]
	# or
	uv sync --extra eval
	```

	Supported test sets: `librispeech_pc`, `seedtts_en`, `seedtts_zh`, `fleurs`, `minimax`.

	```bash
	bash examples/run_eval.sh
	```

	> See [docs/evaluation.md](../docs/evaluation.md) for metrics details, test set preparation, and running individual metrics.