Raon-OpenTTS-0.3B

Technical Report (Coming soon) | Code | Dataset | Raon-OpenTTS-1B (Coming soon)

Raon-OpenTTS is an open-data, open-weight zero-shot TTS system that performs on par with state-of-the-art closed-data models. This is the 0.3B variant.

Key Features

Fully Open: Both model weights and training data (615K hours, 11 English speech datasets) are publicly available for reproducible TTS research.
More Robust on Wild Speech: Achieves lower WER than F5-TTS on the Wild split of Raon-OpenTTS-Eval, demonstrating better robustness to unscripted conversational speech prompts.
Large-Scale Curated Data: Trained on Raon-OpenTTS-Core (510K hours), quality-filtered from Raon-OpenTTS-Pool using combined DNSMOS, WER, and VAD rank-based filtering.
DiT Architecture: Based on F5-TTS Diffusion Transformer with flow matching, enabling efficient zero-shot speech synthesis.

Model Details


Parameters	336M
Architecture	DiT (Diffusion Transformer), based on F5-TTS
Config	dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4
Training Data	Raon-OpenTTS-Core (510.1K hours)
Steps	225K updates
Hardware	48 NVIDIA B200 GPUs
Batch Size	672K frames (14K/GPU x 48 GPUs)
Optimizer	AdamW, peak LR 1e-4, 50K warmup, linear decay, grad norm 1.0
Audio	80-ch mel-spectrogram, 16kHz, hop=256
Vocoder	HiFi-GAN (speechbrain/tts-hifigan-libritts-16kHz)

Benchmark Results (Seed-TTS-Eval)

Model	Params	WER (lower is better)	SIM (higher is better)
Human	-	2.14	0.734
Seed-TTS	-	2.25	0.762
CosyVoice 3	1.5B	2.21	0.720
Qwen3-TTS	1.7B	1.46	0.715
F5-TTS	0.3B	2.04	0.671
Raon-OpenTTS-0.3B	0.3B	1.97	0.703

Benchmark Results (Raon-OpenTTS-Eval Wild)

Model	Params	WER (lower is better)	SIM (higher is better)
F5-TTS	0.3B	TBD	TBD
Raon-OpenTTS-0.3B	0.3B	TBD	TBD

Inference

For inference code and usage instructions, see KRAFTON/Raon-OpenTTS.

Training Details

Raon-OpenTTS-0.3B was trained for 225K update steps on 48 NVIDIA B200 GPUs using the Raon-OpenTTS-Core dataset (510.1K hours of English speech). The model uses AdamW optimization with a peak learning rate of 1e-4, 50K warmup steps, and linear decay. Gradient norm is clipped at 1.0. Waveform synthesis uses a HiFi-GAN vocoder pretrained on LibriTTS at 16kHz.

Citation

@article{raon2026opentts,
  title     = {Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
  author    = {TBD},
  year      = {2026},
  url       = {https://github.com/krafton-ai/Raon-OpenTTS}
}

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Downloads last month: 43

Dataset used to train KRAFTON/Raon-OpenTTS-0.3B

Collection including KRAFTON/Raon-OpenTTS-0.3B

Raon

Collection

8 items • Updated about 23 hours ago • 31