Initial upload: OmniVoice + singing + emotion finetune

2c17b79 verified 12 days ago

5.34 kB

	---
	language:
	- en
	- zh
	- ja
	- ko
	- es
	- fr
	- de
	- it
	- ru
	- hi
	- gu
	library_name: transformers
	pipeline_tag: text-to-speech
	license: apache-2.0
	base_model: k2-fsa/OmniVoice
	tags:
	- text-to-speech
	- tts
	- singing
	- emotion
	- expressive-tts
	- multilingual
	- voice-cloning
	- omnivoice
	---

	# OmniVoice — Singing + Emotion Finetune

	A finetune of [`k2-fsa/OmniVoice`](https://huggingface.co/k2-fsa/OmniVoice) that adds:

	- `[singing]` tag — sung speech / nursery-style melodic vocals
	- Emotion tags — `[happy]`, `[sad]`, `[angry]`, `[excited]`, `[calm]`, `[nervous]`, `[whisper]`
	- Combined tags — e.g. `[singing] [happy] ...` or `[singing] [sad] ...`

	Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are preserved — the base speech head was protected during finetuning with a continuity mix of plain speech and singing.

	## Drop-in replacement

	This checkpoint is fully compatible with the upstream [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) code — same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:

	```python
	from omnivoice.models.omnivoice import OmniVoice

	model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()

	# Normal speech (unchanged behavior)
	audios = model.generate(
	text="The quick brown fox jumps over the lazy dog.",
	language="English",
	)

	# Singing
	audios = model.generate(
	text="[singing] Twinkle twinkle little star, how I wonder what you are.",
	language="English",
	)

	# Emotional speech
	audios = model.generate(
	text="[happy] I just got the best news of my entire year!",
	language="English",
	)

	# Combined
	audios = model.generate(
	text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
	language="English",
	)

	import soundfile as sf
	sf.write("out.wav", audios[0], model.sampling_rate)
	```

	CLI works the same way:

	```bash
	omnivoice-infer --model ModelsLab/omnivoice-singing \
	--text "[happy] Hello there, how wonderful to see you today!" \
	--language English \
	--output out.wav
	```

	## Supported tags

	\| Tag \| Source data \| Strength \|
	\|---\|---\|---\|
	\| `[singing]` \| GTSinger English (6,755 clips, ~8 h) \| strong \|
	\| `[happy]` \| CREMA-D + RAVDESS + Expresso (~2900 clips) \| strong \|
	\| `[sad]` \| CREMA-D + RAVDESS + Expresso (~2900 clips) \| strong \|
	\| `[angry]` \| CREMA-D + RAVDESS (~1500 clips) \| strong \|
	\| `[nervous]` \| CREMA-D fear + RAVDESS fearful (~1400 clips) \| strong \|
	\| `[whisper]` \| Expresso whisper (~1500 clips) \| strong \|
	\| `[calm]` \| RAVDESS calm (~190 clips) \| weak — limited data \|
	\| `[excited]` \| RAVDESS surprised (~190 clips) \| weak — limited data \|

	Guidance scale of 3.0 (up from default 2.0) is recommended to make tag behavior more pronounced:

	```python
	audios = model.generate(
	text="[happy] Welcome!",
	language="English",
	guidance_scale=3.0,
	)
	```

	## What's preserved from the base

	- Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
	- Voice cloning from reference audio (`ref_audio` / `ref_text` args)
	- Voice design via `instruct` parameter (pitch / gender / age / accent attributes)
	- Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
	- Speed and duration control (`speed` / `duration` args)
	- Built-in non-verbal symbols (`[laughter]`, `[sigh]`, etc.)

	## Training

	Two-stage finetune from `k2-fsa/OmniVoice`:

	Stage 1 — Singing (2500 steps):
	- GTSinger English (6.8k clips, tagged `[singing] {lyrics}`)
	- LibriTTS-R dev+test clean (10k clips, plain text — speech preservation)
	- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
	- Final eval loss: 4.74

	Stage 2 — Emotion (2500 steps, forked from singing/checkpoint-2500):
	- CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
	- 1.5k singing + 1.5k speech continuity samples
	- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
	- Best eval loss: 4.72 (at step 750) / final 4.88 (step 2500 — this checkpoint, found to sound better subjectively)

	This published checkpoint is the final emotion step 2500, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.

	## Known limitations

	- `[calm]` and `[excited]` had only ~190 training samples each (only one dataset contributed) — behavior is weaker than the other emotion tags.
	- Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation — works but quality varies.
	- Like the base model, output quality is bounded by the HiggsAudioV2 tokenizer (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.

	## License

	Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:
	- GTSinger: CC BY-NC-SA 4.0 (research use)
	- CREMA-D: ODbL
	- RAVDESS: CC BY-NC-SA 4.0
	- Expresso: CC BY-NC 4.0
	- LibriTTS-R: CC BY 4.0

	## Acknowledgements

	- [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) — base model & training framework
	- [HiggsAudioV2](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer) — discrete audio tokenizer
	- Qwen team — Qwen3-0.6B backbone
	- Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams