You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for F5TTS_ft

F5TTS_ft is a fine-tuned Chinese text-to-speech (TTS) model based on the original F5-TTS architecture, optimized for improved naturalness, prosody, and stability in Mandarin Chinese speech synthesis.

Model Details

Model Description

  • Developed by: Yougen Yuan
  • Funded by [optional]: Personal research project
  • Shared by [optional]: Yougen Yuan
  • Model type: Text-to-Speech (TTS), Diffusion-based TTS
  • Language(s) (NLP): Chinese (Mandarin, zh-CN)
  • License: Apache-2.0
  • Finetuned from model [optional]: Original F5-TTS base model

Model Sources [optional]

Uses

Direct Use

This model can be directly used for end-to-end Chinese text-to-speech synthesis:

  • Convert clean Chinese text input into natural-sounding speech audio
  • Be used in voice generation, audiobook creation, voice assistants, and multimedia dubbing
  • Run with minimal inference code compatible with F5-TTS pipeline

Downstream Use [optional]

  • Integrated into larger voice systems: TTS services, real-time voice generation pipelines
  • Further fine-tuned on custom datasets for specific speakers, styles, or domains
  • Used as a backbone for voice cloning, style transfer, or multilingual TTS extensions

Out-of-Scope Use

  • Not intended for malicious use, deepfake voice impersonation, or deceptive voice generation
  • Not optimized for extremely noisy text, code-mixed heavy slang, or non-Chinese languages
  • Not designed for real-time low-latency embedded devices without optimization
  • Not suitable for high-stakes applications (legal, medical announcements) without human verification

Bias, Risks, and Limitations

  • Speech style and prosody are constrained by the fine-tuning data distribution; may lack diversity in emotional expression
  • Pronunciation accuracy depends on text normalization; rare words, proper nouns, or ancient Chinese may be mispronounced
  • Audio quality degrades with extremely long sentences or unformatted messy text
  • Potential bias in voice characteristics reflects the training dataset’s speaker and accent distribution
  • No built-in content safety; may synthesize harmful or inappropriate text if provided as input

Recommendations

Users should:

  • Clean and normalize input text (punctuation, proper nouns, numbers) for best results
  • Avoid using the model for deceptive, harmful, or non-consensual voice generation
  • Add content moderation layers when deploying in public or commercial systems
  • Conduct further fine-tuning if domain-specific pronunciation or voice style is required

How to Get Started with the Model

This model follows the original F5-TTS inference framework. Example usage:

# Load model from Hugging Face Hub
from f5_tts.model import DiT, UNetT
from f5_tts.infer.utils_infer import load_vocoder, load_model

model = load_model(
    model_name="Yougen/F5TTS_ft",
    device="cuda"  # or "cpu"
)
vocoder = load_vocoder()

# Run TTS inference
# Refer to official F5-TTS inference code for full pipeline

Full inference code is available in the original F5-TTS repository:
https://github.com/SWivid/F5-TTS

Training Details

Training Data

Fine-tuned on a private Chinese Mandarin speech dataset with:

  • Clean, single-speaker or multi-speaker audio
  • Aligned text transcripts
  • Standard Mandarin pronunciation (Putonghua)
  • Preprocessed to 24kHz audio, clipped silences, normalized volume

Training Procedure

Preprocessing [optional]

  • Text: Chinese tokenization, phoneme / prosody annotation
  • Audio: Mel-spectrogram extraction, 24kHz sampling rate
  • Data filtering: removed low-quality, truncated, or misaligned samples

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Optimizer: AdamW
  • Learning rate: standard for diffusion TTS
  • Batch size and steps adjusted for fine-tuning

Speeds, Sizes, Times [optional]

Training performed on single NVIDIA GPU with sufficient VRAM.
Checkpoint size matches original F5-TTS architecture.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal held-out Chinese test set with diverse sentences and scenarios.

Factors

  • Speech naturalness
  • Pronunciation accuracy
  • Intelligibility
  • Prosodic consistency

Metrics

  • Subjective MOS (Mean Opinion Score)
  • Objective mel-spectrogram reconstruction loss
  • Intelligibility validation

Results

Fine-tuned version shows improved stability and naturalness compared to the baseline on Chinese speech.

Summary

F5TTS_ft improves Mandarin TTS quality with better prosody, clearer pronunciation, and more consistent audio generation.

Model Examination [optional]

No additional interpretability analysis provided beyond standard diffusion TTS behavior.

Environmental Impact

  • Hardware Type: NVIDIA GPU (CUDA-enabled)
  • Hours used: Not precisely recorded
  • Cloud Provider: None (local training)
  • Compute Region: N/A
  • Carbon Emitted: Not calculated

Technical Specifications [optional]

Model Architecture and Objective

  • Architecture: Diffusion transformer (DiT) based sequence-to-sequence TTS
  • Objective: Predict mel-spectrograms from text tokens via diffusion steps
  • Vocoder: Compatible with the official F5-TTS vocoder

Compute Infrastructure

Hardware

NVIDIA GPU with CUDA support (recommended >= 12GB VRAM for inference)

Software

  • PyTorch
  • F5-TTS official codebase
  • Hugging Face Hub library

Citation [optional]

BibTeX:

@misc{F5TTS,
  author = {SWivid},
  title = {F5-TTS: A Non-Autoregressive Diffusion TTS Model},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SWivid/F5-TTS}}
}

@misc{F5TTS_ft,
  author = {Yougen Yuan},
  title = {F5TTS_ft: Fine-tuned Chinese F5-TTS Model},
  year = {2026},
  publisher = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/Yougen/F5TTS_ft}}
}

APA:

SWivid. (2024). F5-TTS: A Non-Autoregressive Diffusion TTS Model. GitHub. https://github.com/SWivid/F5-TTS

Yuan, Y. (2026). F5TTS_ft: Fine-tuned Chinese F5-TTS Model. Hugging Face Hub. https://huggingface.co/Yougen/F5TTS_ft

Glossary [optional]

  • TTS: Text-to-Speech
  • F5-TTS: Original diffusion-based TTS architecture
  • Mel-spectrogram: Audio frequency representation used in TTS
  • Fine-tuned: Model adapted from a pre-trained checkpoint on new data

More Information [optional]

This model is a research-oriented fine-tune for Chinese speech synthesis and is not officially affiliated with the original F5-TTS authors.

Model Card Authors [optional]

Yougen Yuan

Model Card Contact

Yougen Yuan (via Hugging Face Hub)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support