Raon-OpenTTS-0.3B

Raon-OpenTTS

Homepage GitHub Hugging Face X License

Technical Report (Coming soon) | Code | Dataset | Raon-OpenTTS-1B (Coming soon)

Raon-OpenTTS is an open-data, open-weight zero-shot TTS system that performs on par with state-of-the-art closed-data models. This is the 0.3B variant.

Key Features

  • Fully Open: Both model weights and training data (615K hours, 11 English speech datasets) are publicly available for reproducible TTS research.
  • More Robust on Wild Speech: Achieves lower WER than F5-TTS on the Wild split of Raon-OpenTTS-Eval, demonstrating better robustness to unscripted conversational speech prompts.
  • Large-Scale Curated Data: Trained on Raon-OpenTTS-Core (510K hours), quality-filtered from Raon-OpenTTS-Pool using combined DNSMOS, WER, and VAD rank-based filtering.
  • DiT Architecture: Based on F5-TTS Diffusion Transformer with flow matching, enabling efficient zero-shot speech synthesis.

Model Details

Parameters 336M
Architecture DiT (Diffusion Transformer), based on F5-TTS
Config dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4
Training Data Raon-OpenTTS-Core (510.1K hours)
Steps 225K updates
Hardware 48 NVIDIA B200 GPUs
Batch Size 672K frames (14K/GPU x 48 GPUs)
Optimizer AdamW, peak LR 1e-4, 50K warmup, linear decay, grad norm 1.0
Audio 80-ch mel-spectrogram, 16kHz, hop=256
Vocoder HiFi-GAN (speechbrain/tts-hifigan-libritts-16kHz)

Benchmark Results (Seed-TTS-Eval)

Model Params WER (lower is better) SIM (higher is better)
Human - 2.14 0.734
Seed-TTS - 2.25 0.762
CosyVoice 3 1.5B 2.21 0.720
Qwen3-TTS 1.7B 1.46 0.715
F5-TTS 0.3B 2.04 0.671
Raon-OpenTTS-0.3B 0.3B 1.97 0.703

Benchmark Results (Raon-OpenTTS-Eval Wild)

Model Params WER (lower is better) SIM (higher is better)
F5-TTS 0.3B TBD TBD
Raon-OpenTTS-0.3B 0.3B TBD TBD

Inference

For inference code and usage instructions, see KRAFTON/Raon-OpenTTS.

Training Details

Raon-OpenTTS-0.3B was trained for 225K update steps on 48 NVIDIA B200 GPUs using the Raon-OpenTTS-Core dataset (510.1K hours of English speech). The model uses AdamW optimization with a peak learning rate of 1e-4, 50K warmup steps, and linear decay. Gradient norm is clipped at 1.0. Waveform synthesis uses a HiFi-GAN vocoder pretrained on LibriTTS at 16kHz.

Citation

@article{raon2026opentts,
  title     = {Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
  author    = {TBD},
  year      = {2026},
  url       = {https://github.com/krafton-ai/Raon-OpenTTS}
}

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

© 2026 KRAFTON

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KRAFTON/Raon-OpenTTS-0.3B

Collection including KRAFTON/Raon-OpenTTS-0.3B