Text-to-Speech
Transformers
Safetensors
qwen3
text-generation
speech
tts
voice
text-generation-inference
rumourscape commited on
Commit
6832e10
·
verified ·
1 Parent(s): 4a4449c

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
File without changes
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # All 22 scheduled Indian languages + English TTS model
3
+ language:
4
+ - hi
5
+ - bn
6
+ - mr
7
+ - te
8
+ - kn
9
+ - mai
10
+ - as
11
+ - brx
12
+ - doi
13
+ - gu
14
+ - ml
15
+ - pa
16
+ - ta
17
+ - ne
18
+ - sa
19
+ - sat
20
+ - sd
21
+ - or
22
+ - mni
23
+ - ks
24
+ - kok
25
+ - ur
26
+ - en
27
+ base_model: Aratako/MioTTS-0.6B
28
+ library_name: transformers
29
+ model_name: Indic-Mio
30
+ pipeline_tag: text-to-speech
31
+ tags:
32
+ - speech
33
+ - tts
34
+ - voice
35
+ licence: apache-2.0
36
+ ---
37
+
38
+ # Model Card for Indic-Mio
39
+
40
+ <b>Indic-Mio</b> is an open-source Text-to-Speech (TTS) model that supports all <b>22 scheduled Indian languages and English</b>. Produces high-quality natural-sounding speech at <b>44kHz</b> with less than <b>0.1 RTF</b>. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.
41
+
42
+ This model is a fine-tuned version of [Aratako/MioTTS-0.6B](https://huggingface.co/Aratako/MioTTS-0.6B) which uses [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) for speech tokenization.
43
+ <!-- It has been trained using Transformers, Unsloth and [TRL](https://github.com/huggingface/trl). -->
44
+
45
+
46
+ ## Prompting
47
+
48
+ For emotion and style control, place the tags <b>at the end</b> of the sentence.
49
+
50
+ For example: `मुझे यह फिल्म बहुत पसंद आई! <happy>` or `I am not sure if I can do this. <confused>`
51
+
52
+ Tags for Indian languages: `<happy>`, `<sad>`, `<angry>`, `<disgust>`, `<fear>`, `<surprise>` <br>
53
+ Tags for English: `<happy>`, `<sad>`, `<enunciated>`, `<confused>`, `<angry>`, `<whisper>`
54
+
55
+ A word can be stressed by using asterisks(*) around it. For example: `No! I could *never* do it!`
56
+
57
+ ## Inference
58
+
59
+ <b>Approach 1: With MioTTS-Inference (recommended)</b>
60
+
61
+ Install [vllm](https://github.com/vllm-project/vllm) and set up [MioTTS-Inference](https://github.com/Aratako/MioTTS-Inference).
62
+
63
+ ```bash
64
+ vllm serve SPRINGLab/Indic-Mio --max-model-len 1024 --gpu-memory-utilization 0.5
65
+ ```
66
+
67
+ ```bash
68
+ cd MioTTS-Inference
69
+ MIOTTS_CODEC_MODEL=Aratako/MioCodec-25Hz-44.1kHz-v2 \
70
+ MIOTTS_LLM_BASE_URL=http://localhost:8000/v1 \
71
+ python run_server.py --host 0.0.0.0 --port 8001
72
+ ```
73
+
74
+ ```bash
75
+ GRADIO_SERVER_PORT=7861 \
76
+ MIOTTS_API_BASE=http://127.0.0.1:8001 \
77
+ python run_gradio.py
78
+ ```
79
+
80
+ <b>Approach 2: Directly with Transformers</b>
81
+
82
+ ```bash
83
+ from transformers import AutoTokenizer, AutoModelForCausalLM
84
+ from miocodec import MioCodec
85
+ import numpy as np
86
+ import torch
87
+
88
+ model_name = "SPRINGLab/Indic-Mio"
89
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
90
+ model = AutoModelForCausalLM.from_pretrained(
91
+ model_name, torch_dtype=torch.bfloat16, device_map="cuda"
92
+ )
93
+
94
+ text = "नमस्ते, आप कैसे हैं?"
95
+ messages = [{"role": "user", "content": text}]
96
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
+
98
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
99
+ output = model.generate(
100
+ **inputs,
101
+ max_new_tokens=1024,
102
+ temperature=0.9,
103
+ top_p=0.9,
104
+ )
105
+
106
+ generated = output[0][inputs["input_ids"].shape[1]:]
107
+ speech_offset = 151669
108
+ audio_codes = [t.item() - speech_offset for t in generated
109
+ if speech_offset <= t.item() < speech_offset + 12800]
110
+
111
+ # Convert audio_codes by decoding with MioCodec
112
+ # audio_codes -> numpy array -> MioCodec decode -> wav
113
+
114
+ codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
115
+ codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0) # [1, 1, T]
116
+ wav = codec.decode(codes_tensor) # -> [1, 1, num_samples]
117
+
118
+ import soundfile as sf
119
+ sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)
120
+
121
+ ```
122
+
123
+ ## Training
124
+
125
+ This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.
126
+
127
+ For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.
128
+
129
+ ## Fine-tuning
130
+
131
+ This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.
132
+
133
+ ## Citations
134
+
135
+ In case you use this model, please cite this huggingface repository as follows:
136
+
137
+ ```bibtex
138
+ @misc{indic-mio-tts,
139
+ title={Indic-Mio TTS},
140
+ author={Advait Joglekar},
141
+ year={2026},
142
+ publisher = {Hugging Face},
143
+ howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
144
+ }
145
+ ```
added_tokens.json ADDED
The diff for this file is too large to render. See raw diff
 
chat_template.jinja ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {%- for message in messages %}
2
+ {{ '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n' }}
3
+ {%- endfor %}
4
+ {%- if add_generation_prompt %}
5
+ {{ '<|im_start|>assistant\n' }}
6
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "dtype": "bfloat16",
8
+ "eos_token_id": 151645,
9
+ "head_dim": 128,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_types": [
15
+ "full_attention",
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention"
43
+ ],
44
+ "max_position_embeddings": 32768,
45
+ "max_window_layers": 28,
46
+ "model_type": "qwen3",
47
+ "num_attention_heads": 16,
48
+ "num_hidden_layers": 28,
49
+ "num_key_value_heads": 8,
50
+ "pad_token_id": 151643,
51
+ "rms_norm_eps": 1e-06,
52
+ "rope_scaling": null,
53
+ "rope_theta": 1000000,
54
+ "sliding_window": null,
55
+ "tie_word_embeddings": true,
56
+ "transformers_version": "4.57.6",
57
+ "unsloth_version": "2026.2.1",
58
+ "use_cache": false,
59
+ "use_sliding_window": false,
60
+ "vocab_size": 164480
61
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 151645,
5
+ 151643
6
+ ],
7
+ "max_length": 32768,
8
+ "max_new_tokens": 2048,
9
+ "pad_token_id": 151643,
10
+ "transformers_version": "4.57.6"
11
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:065f42f7ab6148b66f43e9be0d01ac336343dfe16161350c37b71a87c3e1981b
3
+ size 1217825224
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abcde038b87ccd029a4523b0c5cec1da6d84b4f3d68b351495df086d63033f1f
3
+ size 13817944
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff