You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for F5TTS_ft

F5TTS_ft is a fine-tuned Chinese text-to-speech (TTS) model based on the original F5-TTS architecture, optimized for improved naturalness, prosody, and stability in Mandarin Chinese speech synthesis.

Model Details

Model Description

Developed by: Yougen Yuan
Funded by [optional]: Personal research project
Shared by [optional]: Yougen Yuan
Model type: Text-to-Speech (TTS), Diffusion-based TTS
Language(s) (NLP): Chinese (Mandarin, zh-CN)
License: Apache-2.0
Finetuned from model [optional]: Original F5-TTS base model

Model Sources [optional]

Repository: https://huggingface.co/Yougen/F5TTS_ft
Paper [optional]: F5-TTS original research (https://github.com/SWivid/F5-TTS)
Demo [optional]: Not publicly available

Uses

Direct Use

This model can be directly used for end-to-end Chinese text-to-speech synthesis:

Convert clean Chinese text input into natural-sounding speech audio
Be used in voice generation, audiobook creation, voice assistants, and multimedia dubbing
Run with minimal inference code compatible with F5-TTS pipeline

Downstream Use [optional]

Integrated into larger voice systems: TTS services, real-time voice generation pipelines
Further fine-tuned on custom datasets for specific speakers, styles, or domains
Used as a backbone for voice cloning, style transfer, or multilingual TTS extensions

Out-of-Scope Use

Not intended for malicious use, deepfake voice impersonation, or deceptive voice generation
Not optimized for extremely noisy text, code-mixed heavy slang, or non-Chinese languages
Not designed for real-time low-latency embedded devices without optimization
Not suitable for high-stakes applications (legal, medical announcements) without human verification

Bias, Risks, and Limitations

Speech style and prosody are constrained by the fine-tuning data distribution; may lack diversity in emotional expression
Pronunciation accuracy depends on text normalization; rare words, proper nouns, or ancient Chinese may be mispronounced
Audio quality degrades with extremely long sentences or unformatted messy text
Potential bias in voice characteristics reflects the training dataset’s speaker and accent distribution
No built-in content safety; may synthesize harmful or inappropriate text if provided as input

Recommendations

Users should:

Clean and normalize input text (punctuation, proper nouns, numbers) for best results
Avoid using the model for deceptive, harmful, or non-consensual voice generation
Add content moderation layers when deploying in public or commercial systems
Conduct further fine-tuning if domain-specific pronunciation or voice style is required

How to Get Started with the Model

This model follows the original F5-TTS inference framework. Example usage:

# Load model from Hugging Face Hub
from f5_tts.model import DiT, UNetT
from f5_tts.infer.utils_infer import load_vocoder, load_model

model = load_model(
    model_name="Yougen/F5TTS_ft",
    device="cuda"  # or "cpu"
)
vocoder = load_vocoder()

# Run TTS inference
# Refer to official F5-TTS inference code for full pipeline

Full inference code is available in the original F5-TTS repository:
https://github.com/SWivid/F5-TTS

Training Details

Training Data

Fine-tuned on a private Chinese Mandarin speech dataset with:

Clean, single-speaker or multi-speaker audio
Aligned text transcripts
Standard Mandarin pronunciation (Putonghua)
Preprocessed to 24kHz audio, clipped silences, normalized volume

Training Procedure

Preprocessing [optional]

Text: Chinese tokenization, phoneme / prosody annotation
Audio: Mel-spectrogram extraction, 24kHz sampling rate
Data filtering: removed low-quality, truncated, or misaligned samples

Training Hyperparameters

Training regime: fp16 mixed precision
Optimizer: AdamW
Learning rate: standard for diffusion TTS
Batch size and steps adjusted for fine-tuning

Speeds, Sizes, Times [optional]

Training performed on single NVIDIA GPU with sufficient VRAM.
Checkpoint size matches original F5-TTS architecture.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal held-out Chinese test set with diverse sentences and scenarios.

Factors

Speech naturalness
Pronunciation accuracy
Intelligibility
Prosodic consistency

Metrics

Subjective MOS (Mean Opinion Score)
Objective mel-spectrogram reconstruction loss
Intelligibility validation

Results

Fine-tuned version shows improved stability and naturalness compared to the baseline on Chinese speech.

Summary

F5TTS_ft improves Mandarin TTS quality with better prosody, clearer pronunciation, and more consistent audio generation.

Model Examination [optional]

No additional interpretability analysis provided beyond standard diffusion TTS behavior.

Environmental Impact

Hardware Type: NVIDIA GPU (CUDA-enabled)
Hours used: Not precisely recorded
Cloud Provider: None (local training)
Compute Region: N/A
Carbon Emitted: Not calculated

Technical Specifications [optional]

Model Architecture and Objective

Architecture: Diffusion transformer (DiT) based sequence-to-sequence TTS
Objective: Predict mel-spectrograms from text tokens via diffusion steps
Vocoder: Compatible with the official F5-TTS vocoder

Compute Infrastructure

Hardware

NVIDIA GPU with CUDA support (recommended >= 12GB VRAM for inference)

Software

PyTorch
F5-TTS official codebase
Hugging Face Hub library

Citation [optional]

BibTeX:

@misc{F5TTS,
  author = {SWivid},
  title = {F5-TTS: A Non-Autoregressive Diffusion TTS Model},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SWivid/F5-TTS}}
}

@misc{F5TTS_ft,
  author = {Yougen Yuan},
  title = {F5TTS_ft: Fine-tuned Chinese F5-TTS Model},
  year = {2026},
  publisher = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/Yougen/F5TTS_ft}}
}

APA:

SWivid. (2024). F5-TTS: A Non-Autoregressive Diffusion TTS Model. GitHub. https://github.com/SWivid/F5-TTS

Yuan, Y. (2026). F5TTS_ft: Fine-tuned Chinese F5-TTS Model. Hugging Face Hub. https://huggingface.co/Yougen/F5TTS_ft

Glossary [optional]

TTS: Text-to-Speech
F5-TTS: Original diffusion-based TTS architecture
Mel-spectrogram: Audio frequency representation used in TTS
Fine-tuned: Model adapted from a pre-trained checkpoint on new data

More Information [optional]

This model is a research-oriented fine-tune for Chinese speech synthesis and is not officially affiliated with the original F5-TTS authors.

Model Card Authors [optional]

Yougen Yuan

Model Card Contact

Yougen Yuan (via Hugging Face Hub)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support