Text-to-Speech
Transformers
Safetensors
qwen3
text-generation
speech
tts
voice
text-generation-inference
Instructions to use SPRINGLab/Indic-Mio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SPRINGLab/Indic-Mio with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="SPRINGLab/Indic-Mio")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SPRINGLab/Indic-Mio") model = AutoModelForCausalLM.from_pretrained("SPRINGLab/Indic-Mio") - Notebooks
- Google Colab
- Kaggle
File size: 5,148 Bytes
6832e10 25feace 6832e10 25feace dd5c127 25feace dd5c127 25feace 6832e10 dd5c127 6832e10 dd5c127 6832e10 25feace | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
language:
- hi
- bn
- mr
- te
- kn
- mai
- as
- brx
- doi
- gu
- ml
- pa
- ta
- ne
- sa
- sat
- sd
- or
- mni
- ks
- kok
- ur
- en
base_model: Aratako/MioTTS-0.6B
library_name: transformers
model_name: Indic-Mio
pipeline_tag: text-to-speech
tags:
- speech
- tts
- voice
datasets:
- ai4bharat/Rasa
- mythicinfinity/libritts_r
- ylacombe/expresso
widget:
- text: >-
प्लान तो बढ़िया है, but wait... Have you checked the hotel bookings? Last
minute पे रूम मिलना is next to impossible on weekends.
output:
url: samples/sample1.wav
- text: >-
The rain hammered against the cold glass as Detective Morgan slammed the
folder onto the table. 'I know you were there that night,' she said, her
voice barely above a whisper. 'The question is — what did you see?'
output:
url: samples/sample2.wav
- text: >-
જ્યારે પણ મને તેની સખત જરૂર હોય ત્યારે આ દુકાનમાં મદદ કરવા માટે ક્યારેય કોઈ
હાજર નથી હોતું. <disgust>
output:
url: samples/sample3.wav
- text: இந்த கோயில்லயா உங்க கல்யாணம் நடந்துச்சு. <surprise>
output:
url: samples/sample4.wav
license: apache-2.0
---
# Model Card for Indic-Mio
<b>Indic-Mio</b> is an open-source Text-to-Speech (TTS) model that supports all <b>22 scheduled Indian languages and English</b>. Produces high-quality natural-sounding speech at <b>44kHz</b> with less than <b>0.1 RTF</b>. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.
This model is a fine-tuned version of [Aratako/MioTTS-0.6B](https://huggingface.co/Aratako/MioTTS-0.6B) which uses [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) for speech tokenization.
<!-- It has been trained using Transformers, Unsloth and [TRL](https://github.com/huggingface/trl). -->
## Prompting
For emotion and style control, place the tags <b>at the end</b> of the sentence.
For example: `मुझे यह फिल्म बहुत पसंद आई! <happy>` or `I am not sure if I can do this. <confused>`
Tags for Indian languages: `<happy>`, `<sad>`, `<angry>`, `<disgust>`, `<fear>`, `<surprise>` <br>
Tags for English: `<happy>`, `<sad>`, `<enunciated>`, `<confused>`, `<angry>`, `<whisper>`
A word can be stressed by using asterisks(*) around it. For example: `No! I could *never* do it!`
## Inference
<b>Approach 1: With MioTTS-Inference (recommended)</b>
Install [vllm](https://github.com/vllm-project/vllm) and set up [MioTTS-Inference](https://github.com/Aratako/MioTTS-Inference).
```bash
vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5
```
```bash
cd MioTTS-Inference
python run_server.py
```
```bash
python run_gradio.py
```
<b>Approach 2: Directly with Transformers</b>
```bash
from transformers import AutoTokenizer, AutoModelForCausalLM
from miocodec import MioCodec
import numpy as np
import torch
model_name = "SPRINGLab/Indic-Mio"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)
text = "नमस्ते, आप कैसे हैं?"
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.9,
top_p=0.9,
)
generated = output[0][inputs["input_ids"].shape[1]:]
speech_offset = 151669
audio_codes = [t.item() - speech_offset for t in generated
if speech_offset <= t.item() < speech_offset + 12800]
# Convert audio_codes by decoding with MioCodec
# audio_codes -> numpy array -> MioCodec decode -> wav
codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0) # [1, 1, T]
wav = codec.decode(codes_tensor) # -> [1, 1, num_samples]
import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)
```
## Training
This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.
For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.
## Fine-tuning
This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.
## Citations
In case you use this model, please cite this huggingface repository as follows:
```bibtex
@misc{indic-mio-tts,
title={Indic-Mio TTS},
author={Advait Joglekar},
year={2026},
publisher = {Hugging Face},
howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
}
``` |