Raon-OpenTTS-0.3B
Technical Report (Coming soon) | Code | Dataset | Raon-OpenTTS-1B (Coming soon)
Raon-OpenTTS is an open-data, open-weight zero-shot TTS system that performs on par with state-of-the-art closed-data models. This is the 0.3B variant.
Key Features
- Fully Open: Both model weights and training data (615K hours, 11 English speech datasets) are publicly available for reproducible TTS research.
- More Robust on Wild Speech: Achieves lower WER than F5-TTS on the Wild split of Raon-OpenTTS-Eval, demonstrating better robustness to unscripted conversational speech prompts.
- Large-Scale Curated Data: Trained on Raon-OpenTTS-Core (510K hours), quality-filtered from Raon-OpenTTS-Pool using combined DNSMOS, WER, and VAD rank-based filtering.
- DiT Architecture: Based on F5-TTS Diffusion Transformer with flow matching, enabling efficient zero-shot speech synthesis.
Model Details
| Parameters | 336M |
| Architecture | DiT (Diffusion Transformer), based on F5-TTS |
| Config | dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4 |
| Training Data | Raon-OpenTTS-Core (510.1K hours) |
| Steps | 225K updates |
| Hardware | 48 NVIDIA B200 GPUs |
| Batch Size | 672K frames (14K/GPU x 48 GPUs) |
| Optimizer | AdamW, peak LR 1e-4, 50K warmup, linear decay, grad norm 1.0 |
| Audio | 80-ch mel-spectrogram, 16kHz, hop=256 |
| Vocoder | HiFi-GAN (speechbrain/tts-hifigan-libritts-16kHz) |
Benchmark Results (Seed-TTS-Eval)
| Model | Params | WER (lower is better) | SIM (higher is better) |
|---|---|---|---|
| Human | - | 2.14 | 0.734 |
| Seed-TTS | - | 2.25 | 0.762 |
| CosyVoice 3 | 1.5B | 2.21 | 0.720 |
| Qwen3-TTS | 1.7B | 1.46 | 0.715 |
| F5-TTS | 0.3B | 2.04 | 0.671 |
| Raon-OpenTTS-0.3B | 0.3B | 1.97 | 0.703 |
Benchmark Results (Raon-OpenTTS-Eval Wild)
| Model | Params | WER (lower is better) | SIM (higher is better) |
|---|---|---|---|
| F5-TTS | 0.3B | TBD | TBD |
| Raon-OpenTTS-0.3B | 0.3B | TBD | TBD |
Inference
For inference code and usage instructions, see KRAFTON/Raon-OpenTTS.
Training Details
Raon-OpenTTS-0.3B was trained for 225K update steps on 48 NVIDIA B200 GPUs using the Raon-OpenTTS-Core dataset (510.1K hours of English speech). The model uses AdamW optimization with a peak learning rate of 1e-4, 50K warmup steps, and linear decay. Gradient norm is clipped at 1.0. Waveform synthesis uses a HiFi-GAN vocoder pretrained on LibriTTS at 16kHz.
Citation
@article{raon2026opentts,
title = {Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
author = {TBD},
year = {2026},
url = {https://github.com/krafton-ai/Raon-OpenTTS}
}
License
This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
© 2026 KRAFTON
- Downloads last month
- 43