# OmniVoice Examples This directory contains scripts and configs for training, fine-tuning, and evaluating OmniVoice. | Use Case | Script | Description | |---|---|---| | Training from scratch | [run_emilia.sh](run_emilia.sh) | Full pipeline on the Emilia dataset (data check, tokenization, training) | | Fine-tuning | [run_finetune.sh](run_finetune.sh) | Fine-tune from a pretrained checkpoint using your own JSONL data | | Evaluation | [run_eval.sh](run_eval.sh) | Evaluate WER, speaker similarity, and UTMOS on standard test sets | --- ## Training from Scratch (Emilia) [run_emilia.sh](run_emilia.sh) runs the full pipeline in 3 stages: | Stage | What it does | |---|---| | 0 | Verify the Emilia dataset and JSONL manifests are in place | | 1 | Tokenize audio into WebDataset shards | | 2 | Launch multi-GPU training with `accelerate` | **Prerequisites:** 1. Download the Emilia dataset from [OpenXLab](https://openxlab.org.cn/datasets/Amphion/Emilia) and place it under `download/`: ``` download/Amphion___Emilia └── raw ├── EN └── ZH ``` 2. Obtain JSONL manifests and place them in `data/emilia/manifests/`: - `emilia_en_train.jsonl`, `emilia_en_dev.jsonl` - `emilia_zh_train.jsonl`, `emilia_zh_dev.jsonl` You can generate them from the raw data, or download pre-processed manifests from [HuggingFace](https://huggingface.co/datasets/zhu-han/Emilia-Manifests). **Run the full pipeline:** ```bash bash examples/run_emilia.sh ``` Or run individual stages by setting `stage` and `stop_stage` at the top of the script (e.g. `stage=1`, `stop_stage=1` to only tokenize). > See [docs/training.md](../docs/training.md) for config details, checkpoint resuming, and TensorBoard monitoring. --- ## Fine-tuning [run_finetune.sh](run_finetune.sh) fine-tunes from a pretrained checkpoint on your own data. ### Step 1: Prepare Your Data Create a JSONL manifest where each line describes one audio sample: ```jsonl {"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"} {"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "你好世界", "language_id": "zh"} ``` `id`, `audio_path`, and `text` are mandatory. `language_id` is optional. > See [docs/data_preparation.md](../docs/data_preparation.md) for the full data format specification. ### Step 2: Configure the Script Edit the variables at the top of `run_finetune.sh`: ```bash TRAIN_JSONL="data/my_data_train.jsonl" # path to training JSONL DEV_JSONL="data/my_data_dev.jsonl" # path to dev JSONL GPU_IDS="0,1" # GPUs to use NUM_GPUS=2 OUTPUT_DIR="exp/omnivoice_finetune" # output directory ``` ### Step 3: Run ```bash bash examples/run_finetune.sh ``` The script will: 1. Tokenize your audio into WebDataset shards 2. Launch fine-tuning with `accelerate` Main difference between fine-tuning config ([config/train_config_finetune.json](config/train_config_finetune.json)) and the Emilia training config ([config/train_config_emilia.json](config/train_config_emilia.json)) are: | Parameter | Emilia (from scratch) | Fine-tune | Why | |---|---|---|---| | `init_from_checkpoint` | `null` | `"k2-fsa/OmniVoice"` | Load pretrained weights | | `steps` | 300,000 | 5,000 | Fewer steps for fine-tuning, can be tuned according to your data/task. | | `learning_rate` | 1e-4 | 5e-5 | Lower LR for fine-tuning, can be tuned according to your data/task | To use a different pretrained checkpoint, modify `init_from_checkpoint` in the config file. If you encounter issues with `flex_attention` on your GPU, use [config/train_config_finetune_sdpa.json](config/train_config_finetune_sdpa.json) instead, which uses SDPA attention for broader compatibility. See [docs/training.md](../docs/training.md#attention-implementation) for details. --- ## Evaluation Install evaluation dependencies first: ```bash pip install omnivoice[eval] # or uv sync --extra eval ``` Supported test sets: `librispeech_pc`, `seedtts_en`, `seedtts_zh`, `fleurs`, `minimax`. ```bash bash examples/run_eval.sh ``` > See [docs/evaluation.md](../docs/evaluation.md) for metrics details, test set preparation, and running individual metrics.