File size: 5,338 Bytes
2c17b79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | ---
language:
- en
- zh
- ja
- ko
- es
- fr
- de
- it
- ru
- hi
- gu
library_name: transformers
pipeline_tag: text-to-speech
license: apache-2.0
base_model: k2-fsa/OmniVoice
tags:
- text-to-speech
- tts
- singing
- emotion
- expressive-tts
- multilingual
- voice-cloning
- omnivoice
---
# OmniVoice β Singing + Emotion Finetune
A finetune of [`k2-fsa/OmniVoice`](https://huggingface.co/k2-fsa/OmniVoice) that adds:
- **`[singing]` tag** β sung speech / nursery-style melodic vocals
- **Emotion tags** β `[happy]`, `[sad]`, `[angry]`, `[excited]`, `[calm]`, `[nervous]`, `[whisper]`
- **Combined tags** β e.g. `[singing] [happy] ...` or `[singing] [sad] ...`
Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are **preserved** β the base speech head was protected during finetuning with a continuity mix of plain speech and singing.
## Drop-in replacement
This checkpoint is fully compatible with the upstream [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) code β same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:
```python
from omnivoice.models.omnivoice import OmniVoice
model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()
# Normal speech (unchanged behavior)
audios = model.generate(
text="The quick brown fox jumps over the lazy dog.",
language="English",
)
# Singing
audios = model.generate(
text="[singing] Twinkle twinkle little star, how I wonder what you are.",
language="English",
)
# Emotional speech
audios = model.generate(
text="[happy] I just got the best news of my entire year!",
language="English",
)
# Combined
audios = model.generate(
text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
language="English",
)
import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)
```
CLI works the same way:
```bash
omnivoice-infer --model ModelsLab/omnivoice-singing \
--text "[happy] Hello there, how wonderful to see you today!" \
--language English \
--output out.wav
```
## Supported tags
| Tag | Source data | Strength |
|---|---|---|
| `[singing]` | GTSinger English (6,755 clips, ~8 h) | strong |
| `[happy]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
| `[sad]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
| `[angry]` | CREMA-D + RAVDESS (~1500 clips) | strong |
| `[nervous]` | CREMA-D fear + RAVDESS fearful (~1400 clips) | strong |
| `[whisper]` | Expresso whisper (~1500 clips) | strong |
| `[calm]` | RAVDESS calm (~190 clips) | weak β limited data |
| `[excited]` | RAVDESS surprised (~190 clips) | weak β limited data |
Guidance scale of **3.0** (up from default 2.0) is recommended to make tag behavior more pronounced:
```python
audios = model.generate(
text="[happy] Welcome!",
language="English",
guidance_scale=3.0,
)
```
## What's preserved from the base
- Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
- Voice cloning from reference audio (`ref_audio` / `ref_text` args)
- Voice design via `instruct` parameter (pitch / gender / age / accent attributes)
- Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
- Speed and duration control (`speed` / `duration` args)
- Built-in non-verbal symbols (`[laughter]`, `[sigh]`, etc.)
## Training
Two-stage finetune from `k2-fsa/OmniVoice`:
**Stage 1 β Singing** (2500 steps):
- GTSinger English (6.8k clips, tagged `[singing] {lyrics}`)
- LibriTTS-R dev+test clean (10k clips, plain text β speech preservation)
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Final eval loss: **4.74**
**Stage 2 β Emotion** (2500 steps, forked from singing/checkpoint-2500):
- CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
- 1.5k singing + 1.5k speech continuity samples
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Best eval loss: **4.72** (at step 750) / final **4.88** (step 2500 β this checkpoint, found to sound better subjectively)
This published checkpoint is the **final emotion step 2500**, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.
## Known limitations
- `[calm]` and `[excited]` had only ~190 training samples each (only one dataset contributed) β behavior is weaker than the other emotion tags.
- Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation β works but quality varies.
- Like the base model, output quality is bounded by the **HiggsAudioV2 tokenizer** (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.
## License
Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:
- GTSinger: CC BY-NC-SA 4.0 (research use)
- CREMA-D: ODbL
- RAVDESS: CC BY-NC-SA 4.0
- Expresso: CC BY-NC 4.0
- LibriTTS-R: CC BY 4.0
## Acknowledgements
- [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) β base model & training framework
- [HiggsAudioV2](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer) β discrete audio tokenizer
- Qwen team β Qwen3-0.6B backbone
- Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams
|