Text-to-Speech
Transformers
Safetensors
qwen3
text-generation
speech
tts
voice
text-generation-inference
Indic-Mio / README.md
rumourscape's picture
Update README.md
25feace verified
---
language:
- hi
- bn
- mr
- te
- kn
- mai
- as
- brx
- doi
- gu
- ml
- pa
- ta
- ne
- sa
- sat
- sd
- or
- mni
- ks
- kok
- ur
- en
base_model: Aratako/MioTTS-0.6B
library_name: transformers
model_name: Indic-Mio
pipeline_tag: text-to-speech
tags:
- speech
- tts
- voice
datasets:
- ai4bharat/Rasa
- mythicinfinity/libritts_r
- ylacombe/expresso
widget:
- text: >-
प्लान तो बढ़िया है, but wait... Have you checked the hotel bookings? Last
minute पे रूम मिलना is next to impossible on weekends.
output:
url: samples/sample1.wav
- text: >-
The rain hammered against the cold glass as Detective Morgan slammed the
folder onto the table. 'I know you were there that night,' she said, her
voice barely above a whisper. 'The question is — what did you see?'
output:
url: samples/sample2.wav
- text: >-
જ્યારે પણ મને તેની સખત જરૂર હોય ત્યારે આ દુકાનમાં મદદ કરવા માટે ક્યારેય કોઈ
હાજર નથી હોતું. <disgust>
output:
url: samples/sample3.wav
- text: இந்த கோயில்லயா உங்க கல்யாணம் நடந்துச்சு. <surprise>
output:
url: samples/sample4.wav
license: apache-2.0
---
# Model Card for Indic-Mio
<b>Indic-Mio</b> is an open-source Text-to-Speech (TTS) model that supports all <b>22 scheduled Indian languages and English</b>. Produces high-quality natural-sounding speech at <b>44kHz</b> with less than <b>0.1 RTF</b>. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.
This model is a fine-tuned version of [Aratako/MioTTS-0.6B](https://huggingface.co/Aratako/MioTTS-0.6B) which uses [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) for speech tokenization.
<!-- It has been trained using Transformers, Unsloth and [TRL](https://github.com/huggingface/trl). -->
## Prompting
For emotion and style control, place the tags <b>at the end</b> of the sentence.
For example: `मुझे यह फिल्म बहुत पसंद आई! <happy>` or `I am not sure if I can do this. <confused>`
Tags for Indian languages: `<happy>`, `<sad>`, `<angry>`, `<disgust>`, `<fear>`, `<surprise>` <br>
Tags for English: `<happy>`, `<sad>`, `<enunciated>`, `<confused>`, `<angry>`, `<whisper>`
A word can be stressed by using asterisks(*) around it. For example: `No! I could *never* do it!`
## Inference
<b>Approach 1: With MioTTS-Inference (recommended)</b>
Install [vllm](https://github.com/vllm-project/vllm) and set up [MioTTS-Inference](https://github.com/Aratako/MioTTS-Inference).
```bash
vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5
```
```bash
cd MioTTS-Inference
python run_server.py
```
```bash
python run_gradio.py
```
<b>Approach 2: Directly with Transformers</b>
```bash
from transformers import AutoTokenizer, AutoModelForCausalLM
from miocodec import MioCodec
import numpy as np
import torch
model_name = "SPRINGLab/Indic-Mio"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)
text = "नमस्ते, आप कैसे हैं?"
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.9,
top_p=0.9,
)
generated = output[0][inputs["input_ids"].shape[1]:]
speech_offset = 151669
audio_codes = [t.item() - speech_offset for t in generated
if speech_offset <= t.item() < speech_offset + 12800]
# Convert audio_codes by decoding with MioCodec
# audio_codes -> numpy array -> MioCodec decode -> wav
codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0) # [1, 1, T]
wav = codec.decode(codes_tensor) # -> [1, 1, num_samples]
import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)
```
## Training
This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.
For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.
## Fine-tuning
This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.
## Citations
In case you use this model, please cite this huggingface repository as follows:
```bibtex
@misc{indic-mio-tts,
title={Indic-Mio TTS},
author={Advait Joglekar},
year={2026},
publisher = {Hugging Face},
howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
}
```