Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

.gitattributes +1 -0
.gitignore +0 -0
README.md +145 -0
added_tokens.json +0 -0
chat_template.jinja +6 -0
config.json +61 -0
generation_config.json +11 -0
merges.txt +0 -0
model.safetensors +3 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

File without changes

README.md ADDED Viewed

	@@ -0,0 +1,145 @@

+---
+# All 22 scheduled Indian languages + English TTS model
+language:
+  - hi
+  - bn
+  - mr
+  - te
+  - kn
+  - mai
+  - as
+  - brx
+  - doi
+  - gu
+  - ml
+  - pa
+  - ta
+  - ne
+  - sa
+  - sat
+  - sd
+  - or
+  - mni
+  - ks
+  - kok
+  - ur
+  - en
+base_model: Aratako/MioTTS-0.6B
+library_name: transformers
+model_name: Indic-Mio
+pipeline_tag: text-to-speech
+tags:
+  - speech
+  - tts
+  - voice
+licence: apache-2.0
+---
+# Model Card for Indic-Mio
+<b>Indic-Mio</b> is an open-source Text-to-Speech (TTS) model that supports all <b>22 scheduled Indian languages and English</b>. Produces high-quality natural-sounding speech at <b>44kHz</b> with less than <b>0.1 RTF</b>. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.
+This model is a fine-tuned version of [Aratako/MioTTS-0.6B](https://huggingface.co/Aratako/MioTTS-0.6B) which uses [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) for speech tokenization.
+<!-- It has been trained using Transformers, Unsloth and [TRL](https://github.com/huggingface/trl). -->
+## Prompting
+For emotion and style control, place the tags <b>at the end</b> of the sentence.
+For example: `मुझे यह फिल्म बहुत पसंद आई! <happy>` or `I am not sure if I can do this. <confused>`
+Tags for Indian languages: `<happy>`, `<sad>`, `<angry>`, `<disgust>`, `<fear>`, `<surprise>` <br>
+Tags for English: `<happy>`, `<sad>`, `<enunciated>`, `<confused>`, `<angry>`, `<whisper>`
+A word can be stressed by using asterisks(*) around it. For example: `No! I could *never* do it!`
+## Inference
+<b>Approach 1: With MioTTS-Inference (recommended)</b>
+Install [vllm](https://github.com/vllm-project/vllm) and set up [MioTTS-Inference](https://github.com/Aratako/MioTTS-Inference).
+```bash
+vllm serve SPRINGLab/Indic-Mio --max-model-len 1024 --gpu-memory-utilization 0.5
+```
+```bash
+cd MioTTS-Inference
+MIOTTS_CODEC_MODEL=Aratako/MioCodec-25Hz-44.1kHz-v2 \
+MIOTTS_LLM_BASE_URL=http://localhost:8000/v1 \
+python run_server.py --host 0.0.0.0 --port 8001
+```
+```bash
+GRADIO_SERVER_PORT=7861 \
+MIOTTS_API_BASE=http://127.0.0.1:8001 \
+python run_gradio.py
+```
+<b>Approach 2: Directly with Transformers</b>
+```bash
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from miocodec import MioCodec
+import numpy as np
+import torch
+model_name = "SPRINGLab/Indic-Mio"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name, torch_dtype=torch.bfloat16, device_map="cuda"
+)
+text = "नमस्ते, आप कैसे हैं?"
+messages = [{"role": "user", "content": text}]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+output = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+    temperature=0.9,
+    top_p=0.9,
+)
+generated = output[0][inputs["input_ids"].shape[1]:]
+speech_offset = 151669
+audio_codes = [t.item() - speech_offset for t in generated
+               if speech_offset <= t.item() < speech_offset + 12800]
+# Convert audio_codes by decoding with MioCodec
+# audio_codes -> numpy array -> MioCodec decode -> wav
+codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
+codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0)  # [1, 1, T]
+wav = codec.decode(codes_tensor)  # -> [1, 1, num_samples]
+import soundfile as sf
+sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)
+```
+## Training
+This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.
+For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.
+## Fine-tuning
+This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.
+## Citations
+In case you use this model, please cite this huggingface repository as follows:
+```bibtex
+@misc{indic-mio-tts,
+  title={Indic-Mio TTS},
+  author={Advait Joglekar},
+  year={2026},
+  publisher = {Hugging Face},
+  howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
+}
+```

added_tokens.json ADDED Viewed

The diff for this file is too large to render. See raw diff

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,6 @@

+{%- for message in messages %}
+{{ '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n' }}
+{%- endfor %}
+{%- if add_generation_prompt %}
+{{ '<|im_start|>assistant\n' }}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "architectures": [
+    "Qwen3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "dtype": "bfloat16",
+  "eos_token_id": 151645,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen3",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 8,
+  "pad_token_id": 151643,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.6",
+  "unsloth_version": "2026.2.1",
+  "use_cache": false,
+  "use_sliding_window": false,
+  "vocab_size": 164480
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "max_length": 32768,
+  "max_new_tokens": 2048,
+  "pad_token_id": 151643,
+  "transformers_version": "4.57.6"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:065f42f7ab6148b66f43e9be0d01ac336343dfe16161350c37b71a87c3e1981b
+size 1217825224

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:abcde038b87ccd029a4523b0c5cec1da6d84b4f3d68b351495df086d63033f1f
+size 13817944

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff