| # OmniVoice Examples |
|
|
| This directory contains scripts and configs for training, fine-tuning, and evaluating OmniVoice. |
|
|
| | Use Case | Script | Description | |
| |---|---|---| |
| | Training from scratch | [run_emilia.sh](run_emilia.sh) | Full pipeline on the Emilia dataset (data check, tokenization, training) | |
| | Fine-tuning | [run_finetune.sh](run_finetune.sh) | Fine-tune from a pretrained checkpoint using your own JSONL data | |
| | Evaluation | [run_eval.sh](run_eval.sh) | Evaluate WER, speaker similarity, and UTMOS on standard test sets | |
|
|
| --- |
|
|
| ## Training from Scratch (Emilia) |
|
|
| [run_emilia.sh](run_emilia.sh) runs the full pipeline in 3 stages: |
|
|
| | Stage | What it does | |
| |---|---| |
| | 0 | Verify the Emilia dataset and JSONL manifests are in place | |
| | 1 | Tokenize audio into WebDataset shards | |
| | 2 | Launch multi-GPU training with `accelerate` | |
|
|
| **Prerequisites:** |
|
|
| 1. Download the Emilia dataset from [OpenXLab](https://openxlab.org.cn/datasets/Amphion/Emilia) and place it under `download/`: |
| ``` |
| download/Amphion___Emilia |
| βββ raw |
| βββ EN |
| βββ ZH |
| ``` |
| 2. Obtain JSONL manifests and place them in `data/emilia/manifests/`: |
| - `emilia_en_train.jsonl`, `emilia_en_dev.jsonl` |
| - `emilia_zh_train.jsonl`, `emilia_zh_dev.jsonl` |
|
|
| You can generate them from the raw data, or download pre-processed manifests from [HuggingFace](https://huggingface.co/datasets/zhu-han/Emilia-Manifests). |
|
|
| **Run the full pipeline:** |
|
|
| ```bash |
| bash examples/run_emilia.sh |
| ``` |
|
|
| Or run individual stages by setting `stage` and `stop_stage` at the top of the script (e.g. `stage=1`, `stop_stage=1` to only tokenize). |
|
|
| > See [docs/training.md](../docs/training.md) for config details, checkpoint resuming, and TensorBoard monitoring. |
|
|
| --- |
|
|
| ## Fine-tuning |
|
|
| [run_finetune.sh](run_finetune.sh) fine-tunes from a pretrained checkpoint on your own data. |
|
|
| ### Step 1: Prepare Your Data |
|
|
| Create a JSONL manifest where each line describes one audio sample: |
|
|
| ```jsonl |
| {"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"} |
| {"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "δ½ ε₯½δΈη", "language_id": "zh"} |
| ``` |
|
|
| `id`, `audio_path`, and `text` are mandatory. `language_id` is optional. |
|
|
| > See [docs/data_preparation.md](../docs/data_preparation.md) for the full data format specification. |
|
|
| ### Step 2: Configure the Script |
|
|
| Edit the variables at the top of `run_finetune.sh`: |
|
|
| ```bash |
| TRAIN_JSONL="data/my_data_train.jsonl" # path to training JSONL |
| DEV_JSONL="data/my_data_dev.jsonl" # path to dev JSONL |
| GPU_IDS="0,1" # GPUs to use |
| NUM_GPUS=2 |
| OUTPUT_DIR="exp/omnivoice_finetune" # output directory |
| ``` |
|
|
| ### Step 3: Run |
|
|
| ```bash |
| bash examples/run_finetune.sh |
| ``` |
|
|
| The script will: |
| 1. Tokenize your audio into WebDataset shards |
| 2. Launch fine-tuning with `accelerate` |
|
|
| Main difference between fine-tuning config ([config/train_config_finetune.json](config/train_config_finetune.json)) and the Emilia training config ([config/train_config_emilia.json](config/train_config_emilia.json)) are: |
|
|
| | Parameter | Emilia (from scratch) | Fine-tune | Why | |
| |---|---|---|---| |
| | `init_from_checkpoint` | `null` | `"k2-fsa/OmniVoice"` | Load pretrained weights | |
| | `steps` | 300,000 | 5,000 | Fewer steps for fine-tuning, can be tuned according to your data/task. | |
| | `learning_rate` | 1e-4 | 5e-5 | Lower LR for fine-tuning, can be tuned according to your data/task | |
|
|
| To use a different pretrained checkpoint, modify `init_from_checkpoint` in the config file. |
|
|
| If you encounter issues with `flex_attention` on your GPU, use [config/train_config_finetune_sdpa.json](config/train_config_finetune_sdpa.json) instead, which uses SDPA attention for broader compatibility. See [docs/training.md](../docs/training.md#attention-implementation) for details. |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Install evaluation dependencies first: |
|
|
| ```bash |
| pip install omnivoice[eval] |
| # or |
| uv sync --extra eval |
| ``` |
|
|
| Supported test sets: `librispeech_pc`, `seedtts_en`, `seedtts_zh`, `fleurs`, `minimax`. |
|
|
| ```bash |
| bash examples/run_eval.sh |
| ``` |
|
|
| > See [docs/evaluation.md](../docs/evaluation.md) for metrics details, test set preparation, and running individual metrics. |
|
|
|
|