omnivoice-singing / README.md
adhikjoshi's picture
Initial upload: OmniVoice + singing + emotion finetune
2c17b79 verified
---
language:
- en
- zh
- ja
- ko
- es
- fr
- de
- it
- ru
- hi
- gu
library_name: transformers
pipeline_tag: text-to-speech
license: apache-2.0
base_model: k2-fsa/OmniVoice
tags:
- text-to-speech
- tts
- singing
- emotion
- expressive-tts
- multilingual
- voice-cloning
- omnivoice
---
# OmniVoice β€” Singing + Emotion Finetune
A finetune of [`k2-fsa/OmniVoice`](https://huggingface.co/k2-fsa/OmniVoice) that adds:
- **`[singing]` tag** β€” sung speech / nursery-style melodic vocals
- **Emotion tags** β€” `[happy]`, `[sad]`, `[angry]`, `[excited]`, `[calm]`, `[nervous]`, `[whisper]`
- **Combined tags** β€” e.g. `[singing] [happy] ...` or `[singing] [sad] ...`
Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are **preserved** β€” the base speech head was protected during finetuning with a continuity mix of plain speech and singing.
## Drop-in replacement
This checkpoint is fully compatible with the upstream [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) code β€” same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:
```python
from omnivoice.models.omnivoice import OmniVoice
model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()
# Normal speech (unchanged behavior)
audios = model.generate(
text="The quick brown fox jumps over the lazy dog.",
language="English",
)
# Singing
audios = model.generate(
text="[singing] Twinkle twinkle little star, how I wonder what you are.",
language="English",
)
# Emotional speech
audios = model.generate(
text="[happy] I just got the best news of my entire year!",
language="English",
)
# Combined
audios = model.generate(
text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
language="English",
)
import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)
```
CLI works the same way:
```bash
omnivoice-infer --model ModelsLab/omnivoice-singing \
--text "[happy] Hello there, how wonderful to see you today!" \
--language English \
--output out.wav
```
## Supported tags
| Tag | Source data | Strength |
|---|---|---|
| `[singing]` | GTSinger English (6,755 clips, ~8 h) | strong |
| `[happy]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
| `[sad]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
| `[angry]` | CREMA-D + RAVDESS (~1500 clips) | strong |
| `[nervous]` | CREMA-D fear + RAVDESS fearful (~1400 clips) | strong |
| `[whisper]` | Expresso whisper (~1500 clips) | strong |
| `[calm]` | RAVDESS calm (~190 clips) | weak β€” limited data |
| `[excited]` | RAVDESS surprised (~190 clips) | weak β€” limited data |
Guidance scale of **3.0** (up from default 2.0) is recommended to make tag behavior more pronounced:
```python
audios = model.generate(
text="[happy] Welcome!",
language="English",
guidance_scale=3.0,
)
```
## What's preserved from the base
- Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
- Voice cloning from reference audio (`ref_audio` / `ref_text` args)
- Voice design via `instruct` parameter (pitch / gender / age / accent attributes)
- Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
- Speed and duration control (`speed` / `duration` args)
- Built-in non-verbal symbols (`[laughter]`, `[sigh]`, etc.)
## Training
Two-stage finetune from `k2-fsa/OmniVoice`:
**Stage 1 β€” Singing** (2500 steps):
- GTSinger English (6.8k clips, tagged `[singing] {lyrics}`)
- LibriTTS-R dev+test clean (10k clips, plain text β€” speech preservation)
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Final eval loss: **4.74**
**Stage 2 β€” Emotion** (2500 steps, forked from singing/checkpoint-2500):
- CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
- 1.5k singing + 1.5k speech continuity samples
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Best eval loss: **4.72** (at step 750) / final **4.88** (step 2500 β€” this checkpoint, found to sound better subjectively)
This published checkpoint is the **final emotion step 2500**, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.
## Known limitations
- `[calm]` and `[excited]` had only ~190 training samples each (only one dataset contributed) β€” behavior is weaker than the other emotion tags.
- Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation β€” works but quality varies.
- Like the base model, output quality is bounded by the **HiggsAudioV2 tokenizer** (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.
## License
Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:
- GTSinger: CC BY-NC-SA 4.0 (research use)
- CREMA-D: ODbL
- RAVDESS: CC BY-NC-SA 4.0
- Expresso: CC BY-NC 4.0
- LibriTTS-R: CC BY 4.0
## Acknowledgements
- [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) β€” base model & training framework
- [HiggsAudioV2](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer) β€” discrete audio tokenizer
- Qwen team β€” Qwen3-0.6B backbone
- Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams