Text-to-Speech
Transformers
Safetensors
qwen3
text-generation
speech
tts
voice
text-generation-inference
File size: 5,148 Bytes
6832e10
 
25feace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6832e10
 
 
 
 
25feace
 
 
dd5c127
25feace
 
 
dd5c127
25feace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6832e10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd5c127
6832e10
 
 
 
dd5c127
6832e10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25feace
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
language:
- hi
- bn
- mr
- te
- kn
- mai
- as
- brx
- doi
- gu
- ml
- pa
- ta
- ne
- sa
- sat
- sd
- or
- mni
- ks
- kok
- ur
- en
base_model: Aratako/MioTTS-0.6B
library_name: transformers
model_name: Indic-Mio
pipeline_tag: text-to-speech
tags:
- speech
- tts
- voice
datasets:
- ai4bharat/Rasa
- mythicinfinity/libritts_r
- ylacombe/expresso
widget:
- text: >-
    प्लान तो बढ़िया है, but wait... Have you checked the hotel bookings? Last
    minute पे रूम मिलना is next to impossible on weekends.
  output:
    url: samples/sample1.wav
- text: >-
    The rain hammered against the cold glass as Detective Morgan slammed the
    folder onto the table. 'I know you were there that night,' she said, her
    voice barely above a whisper. 'The question is — what did you see?'
  output:
    url: samples/sample2.wav
- text: >-
    જ્યારે પણ મને તેની સખત જરૂર હોય ત્યારે આ દુકાનમાં મદદ કરવા માટે ક્યારેય કોઈ
    હાજર નથી હોતું. <disgust>
  output:
    url: samples/sample3.wav
- text: இந்த கோயில்லயா உங்க கல்யாணம் நடந்துச்சு. <surprise>
  output:
    url: samples/sample4.wav
license: apache-2.0
---

# Model Card for Indic-Mio

<b>Indic-Mio</b> is an open-source Text-to-Speech (TTS) model that supports all <b>22 scheduled Indian languages and English</b>. Produces high-quality natural-sounding speech at <b>44kHz</b> with less than <b>0.1 RTF</b>. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.

This model is a fine-tuned version of [Aratako/MioTTS-0.6B](https://huggingface.co/Aratako/MioTTS-0.6B) which uses [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) for speech tokenization.
<!-- It has been trained using Transformers, Unsloth and [TRL](https://github.com/huggingface/trl). -->


## Prompting

For emotion and style control, place the tags <b>at the end</b> of the sentence.

For example: `मुझे यह फिल्म बहुत पसंद आई! <happy>` or `I am not sure if I can do this. <confused>`

Tags for Indian languages: `<happy>`, `<sad>`, `<angry>`, `<disgust>`, `<fear>`, `<surprise>` <br>
Tags for English: `<happy>`, `<sad>`, `<enunciated>`, `<confused>`, `<angry>`, `<whisper>`

A word can be stressed by using asterisks(*) around it. For example: `No! I could *never* do it!`

## Inference

<b>Approach 1: With MioTTS-Inference (recommended)</b>

Install [vllm](https://github.com/vllm-project/vllm) and set up [MioTTS-Inference](https://github.com/Aratako/MioTTS-Inference).

```bash
vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5
```

```bash
cd MioTTS-Inference
python run_server.py
```

```bash
python run_gradio.py
```

<b>Approach 2: Directly with Transformers</b>

```bash
from transformers import AutoTokenizer, AutoModelForCausalLM
from miocodec import MioCodec
import numpy as np
import torch

model_name = "SPRINGLab/Indic-Mio"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)

text = "नमस्ते, आप कैसे हैं?"
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.9,
    top_p=0.9,
)

generated = output[0][inputs["input_ids"].shape[1]:]
speech_offset = 151669
audio_codes = [t.item() - speech_offset for t in generated 
               if speech_offset <= t.item() < speech_offset + 12800]

# Convert audio_codes by decoding with MioCodec
# audio_codes -> numpy array -> MioCodec decode -> wav

codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0)  # [1, 1, T]
wav = codec.decode(codes_tensor)  # -> [1, 1, num_samples]

import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)

```

## Training

This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.

For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.

## Fine-tuning

This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.

## Citations

In case you use this model, please cite this huggingface repository as follows:

```bibtex
@misc{indic-mio-tts,
  title={Indic-Mio TTS},
  author={Advait Joglekar},
  year={2026},
  publisher = {Hugging Face},
  howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
}
```