File size: 5,338 Bytes
2c17b79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
language:
- en
- zh
- ja
- ko
- es
- fr
- de
- it
- ru
- hi
- gu
library_name: transformers
pipeline_tag: text-to-speech
license: apache-2.0
base_model: k2-fsa/OmniVoice
tags:
- text-to-speech
- tts
- singing
- emotion
- expressive-tts
- multilingual
- voice-cloning
- omnivoice
---

# OmniVoice β€” Singing + Emotion Finetune

A finetune of [`k2-fsa/OmniVoice`](https://huggingface.co/k2-fsa/OmniVoice) that adds:

- **`[singing]` tag** β€” sung speech / nursery-style melodic vocals
- **Emotion tags** β€” `[happy]`, `[sad]`, `[angry]`, `[excited]`, `[calm]`, `[nervous]`, `[whisper]`
- **Combined tags** β€” e.g. `[singing] [happy] ...` or `[singing] [sad] ...`

Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are **preserved** β€” the base speech head was protected during finetuning with a continuity mix of plain speech and singing.

## Drop-in replacement

This checkpoint is fully compatible with the upstream [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) code β€” same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:

```python
from omnivoice.models.omnivoice import OmniVoice

model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()

# Normal speech (unchanged behavior)
audios = model.generate(
    text="The quick brown fox jumps over the lazy dog.",
    language="English",
)

# Singing
audios = model.generate(
    text="[singing] Twinkle twinkle little star, how I wonder what you are.",
    language="English",
)

# Emotional speech
audios = model.generate(
    text="[happy] I just got the best news of my entire year!",
    language="English",
)

# Combined
audios = model.generate(
    text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
    language="English",
)

import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)
```

CLI works the same way:

```bash
omnivoice-infer --model ModelsLab/omnivoice-singing \
    --text "[happy] Hello there, how wonderful to see you today!" \
    --language English \
    --output out.wav
```

## Supported tags

| Tag | Source data | Strength |
|---|---|---|
| `[singing]` | GTSinger English (6,755 clips, ~8 h) | strong |
| `[happy]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
| `[sad]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
| `[angry]` | CREMA-D + RAVDESS (~1500 clips) | strong |
| `[nervous]` | CREMA-D fear + RAVDESS fearful (~1400 clips) | strong |
| `[whisper]` | Expresso whisper (~1500 clips) | strong |
| `[calm]` | RAVDESS calm (~190 clips) | weak β€” limited data |
| `[excited]` | RAVDESS surprised (~190 clips) | weak β€” limited data |

Guidance scale of **3.0** (up from default 2.0) is recommended to make tag behavior more pronounced:

```python
audios = model.generate(
    text="[happy] Welcome!",
    language="English",
    guidance_scale=3.0,
)
```

## What's preserved from the base

- Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
- Voice cloning from reference audio (`ref_audio` / `ref_text` args)
- Voice design via `instruct` parameter (pitch / gender / age / accent attributes)
- Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
- Speed and duration control (`speed` / `duration` args)
- Built-in non-verbal symbols (`[laughter]`, `[sigh]`, etc.)

## Training

Two-stage finetune from `k2-fsa/OmniVoice`:

**Stage 1 β€” Singing** (2500 steps):
- GTSinger English (6.8k clips, tagged `[singing] {lyrics}`)
- LibriTTS-R dev+test clean (10k clips, plain text β€” speech preservation)
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Final eval loss: **4.74**

**Stage 2 β€” Emotion** (2500 steps, forked from singing/checkpoint-2500):
- CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
- 1.5k singing + 1.5k speech continuity samples
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Best eval loss: **4.72** (at step 750) / final **4.88** (step 2500 β€” this checkpoint, found to sound better subjectively)

This published checkpoint is the **final emotion step 2500**, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.

## Known limitations

- `[calm]` and `[excited]` had only ~190 training samples each (only one dataset contributed) β€” behavior is weaker than the other emotion tags.
- Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation β€” works but quality varies.
- Like the base model, output quality is bounded by the **HiggsAudioV2 tokenizer** (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.

## License

Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:
- GTSinger: CC BY-NC-SA 4.0 (research use)
- CREMA-D: ODbL
- RAVDESS: CC BY-NC-SA 4.0
- Expresso: CC BY-NC 4.0
- LibriTTS-R: CC BY 4.0

## Acknowledgements

- [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) β€” base model & training framework
- [HiggsAudioV2](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer) β€” discrete audio tokenizer
- Qwen team β€” Qwen3-0.6B backbone
- Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams