Update README.md

25feace verified 3 days ago

5.15 kB

	---
	language:
	- hi
	- bn
	- mr
	- te
	- kn
	- mai
	- as
	- brx
	- doi
	- gu
	- ml
	- pa
	- ta
	- ne
	- sa
	- sat
	- sd
	- or
	- mni
	- ks
	- kok
	- ur
	- en
	base_model: Aratako/MioTTS-0.6B
	library_name: transformers
	model_name: Indic-Mio
	pipeline_tag: text-to-speech
	tags:
	- speech
	- tts
	- voice
	datasets:
	- ai4bharat/Rasa
	- mythicinfinity/libritts_r
	- ylacombe/expresso
	widget:
	- text: >-
	प्लान तो बढ़िया है, but wait... Have you checked the hotel bookings? Last
	minute पे रूम मिलना is next to impossible on weekends.
	output:
	url: samples/sample1.wav
	- text: >-
	The rain hammered against the cold glass as Detective Morgan slammed the
	folder onto the table. 'I know you were there that night,' she said, her
	voice barely above a whisper. 'The question is — what did you see?'
	output:
	url: samples/sample2.wav
	- text: >-
	જ્યારે પણ મને તેની સખત જરૂર હોય ત્યારે આ દુકાનમાં મદદ કરવા માટે ક્યારેય કોઈ
	હાજર નથી હોતું. <disgust>
	output:
	url: samples/sample3.wav
	- text: இந்த கோயில்லயா உங்க கல்யாணம் நடந்துச்சு. <surprise>
	output:
	url: samples/sample4.wav
	license: apache-2.0
	---

	# Model Card for Indic-Mio

	<b>Indic-Mio</b> is an open-source Text-to-Speech (TTS) model that supports all <b>22 scheduled Indian languages and English</b>. Produces high-quality natural-sounding speech at <b>44kHz</b> with less than <b>0.1 RTF</b>. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.

	This model is a fine-tuned version of [Aratako/MioTTS-0.6B](https://huggingface.co/Aratako/MioTTS-0.6B) which uses [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) for speech tokenization.
	<!-- It has been trained using Transformers, Unsloth and [TRL](https://github.com/huggingface/trl). -->


	## Prompting

	For emotion and style control, place the tags <b>at the end</b> of the sentence.

	For example: `मुझे यह फिल्म बहुत पसंद आई! <happy>` or `I am not sure if I can do this. <confused>`

	Tags for Indian languages: `<happy>`, `<sad>`, `<angry>`, `<disgust>`, `<fear>`, `<surprise>` <br>
	Tags for English: `<happy>`, `<sad>`, `<enunciated>`, `<confused>`, `<angry>`, `<whisper>`

	A word can be stressed by using asterisks() around it. For example: `No! I could never* do it!`

	## Inference

	<b>Approach 1: With MioTTS-Inference (recommended)</b>

	Install [vllm](https://github.com/vllm-project/vllm) and set up [MioTTS-Inference](https://github.com/Aratako/MioTTS-Inference).

	```bash
	vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5
	```

	```bash
	cd MioTTS-Inference
	python run_server.py
	```

	```bash
	python run_gradio.py
	```

	<b>Approach 2: Directly with Transformers</b>

	```bash
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from miocodec import MioCodec
	import numpy as np
	import torch

	model_name = "SPRINGLab/Indic-Mio"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name, torch_dtype=torch.bfloat16, device_map="cuda"
	)

	text = "नमस्ते, आप कैसे हैं?"
	messages = [{"role": "user", "content": text}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	output = model.generate(
	**inputs,
	max_new_tokens=1024,
	temperature=0.9,
	top_p=0.9,
	)

	generated = output[0][inputs["input_ids"].shape[1]:]
	speech_offset = 151669
	audio_codes = [t.item() - speech_offset for t in generated
	if speech_offset <= t.item() < speech_offset + 12800]

	# Convert audio_codes by decoding with MioCodec
	# audio_codes -> numpy array -> MioCodec decode -> wav

	codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
	codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0) # [1, 1, T]
	wav = codec.decode(codes_tensor) # -> [1, 1, num_samples]

	import soundfile as sf
	sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)

	```

	## Training

	This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.

	For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.

	## Fine-tuning

	This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.

	## Citations

	In case you use this model, please cite this huggingface repository as follows:

	```bibtex
	@misc{indic-mio-tts,
	title={Indic-Mio TTS},
	author={Advait Joglekar},
	year={2026},
	publisher = {Hugging Face},
	howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
	}
	```