--- language: - en - zh - ja - ko - es - fr - de - it - ru - hi - gu library_name: transformers pipeline_tag: text-to-speech license: apache-2.0 base_model: k2-fsa/OmniVoice tags: - text-to-speech - tts - singing - emotion - expressive-tts - multilingual - voice-cloning - omnivoice --- # OmniVoice — Singing + Emotion Finetune A finetune of [`k2-fsa/OmniVoice`](https://huggingface.co/k2-fsa/OmniVoice) that adds: - **`[singing]` tag** — sung speech / nursery-style melodic vocals - **Emotion tags** — `[happy]`, `[sad]`, `[angry]`, `[excited]`, `[calm]`, `[nervous]`, `[whisper]` - **Combined tags** — e.g. `[singing] [happy] ...` or `[singing] [sad] ...` Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are **preserved** — the base speech head was protected during finetuning with a continuity mix of plain speech and singing. ## Drop-in replacement This checkpoint is fully compatible with the upstream [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) code — same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id: ```python from omnivoice.models.omnivoice import OmniVoice model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval() # Normal speech (unchanged behavior) audios = model.generate( text="The quick brown fox jumps over the lazy dog.", language="English", ) # Singing audios = model.generate( text="[singing] Twinkle twinkle little star, how I wonder what you are.", language="English", ) # Emotional speech audios = model.generate( text="[happy] I just got the best news of my entire year!", language="English", ) # Combined audios = model.generate( text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.", language="English", ) import soundfile as sf sf.write("out.wav", audios[0], model.sampling_rate) ``` CLI works the same way: ```bash omnivoice-infer --model ModelsLab/omnivoice-singing \ --text "[happy] Hello there, how wonderful to see you today!" \ --language English \ --output out.wav ``` ## Supported tags | Tag | Source data | Strength | |---|---|---| | `[singing]` | GTSinger English (6,755 clips, ~8 h) | strong | | `[happy]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong | | `[sad]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong | | `[angry]` | CREMA-D + RAVDESS (~1500 clips) | strong | | `[nervous]` | CREMA-D fear + RAVDESS fearful (~1400 clips) | strong | | `[whisper]` | Expresso whisper (~1500 clips) | strong | | `[calm]` | RAVDESS calm (~190 clips) | weak — limited data | | `[excited]` | RAVDESS surprised (~190 clips) | weak — limited data | Guidance scale of **3.0** (up from default 2.0) is recommended to make tag behavior more pronounced: ```python audios = model.generate( text="[happy] Welcome!", language="English", guidance_scale=3.0, ) ``` ## What's preserved from the base - Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.) - Voice cloning from reference audio (`ref_audio` / `ref_text` args) - Voice design via `instruct` parameter (pitch / gender / age / accent attributes) - Fine-grained pronunciation control (pinyin / CMU phoneme overrides) - Speed and duration control (`speed` / `duration` args) - Built-in non-verbal symbols (`[laughter]`, `[sigh]`, etc.) ## Training Two-stage finetune from `k2-fsa/OmniVoice`: **Stage 1 — Singing** (2500 steps): - GTSinger English (6.8k clips, tagged `[singing] {lyrics}`) - LibriTTS-R dev+test clean (10k clips, plain text — speech preservation) - LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192 - Final eval loss: **4.74** **Stage 2 — Emotion** (2500 steps, forked from singing/checkpoint-2500): - CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips) - 1.5k singing + 1.5k speech continuity samples - LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192 - Best eval loss: **4.72** (at step 750) / final **4.88** (step 2500 — this checkpoint, found to sound better subjectively) This published checkpoint is the **final emotion step 2500**, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality. ## Known limitations - `[calm]` and `[excited]` had only ~190 training samples each (only one dataset contributed) — behavior is weaker than the other emotion tags. - Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation — works but quality varies. - Like the base model, output quality is bounded by the **HiggsAudioV2 tokenizer** (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design. ## License Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets: - GTSinger: CC BY-NC-SA 4.0 (research use) - CREMA-D: ODbL - RAVDESS: CC BY-NC-SA 4.0 - Expresso: CC BY-NC 4.0 - LibriTTS-R: CC BY 4.0 ## Acknowledgements - [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) — base model & training framework - [HiggsAudioV2](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer) — discrete audio tokenizer - Qwen team — Qwen3-0.6B backbone - Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams