Thorsten-Voice · CosyVoice3

German Text-to-Speech fine-tune of FunAudioLLM/Fun-CosyVoice3-0.5B-2512 on the Thorsten-Voice 2022.10 dataset (12,283 German utterances).

This repository contains two fine-tuned model components:

File	Size	Description
`llm.pt`	1.9 GB	Fine-tuned LLM (speech rhythm, prosody, speaker style)
`flow.pt`	1.3 GB	Fine-tuned Flow Decoder (voice timbre, spectral characteristics)

The HiFi-GAN vocoder (hift.pt) is used unchanged from the base model.

Quickstart with Docker

The easiest way to use this model is via the official Docker container:

docker run -p 8000:8000 \
  -v cosyvoice_models:/app/CosyVoice/pretrained_models \
  thorstenvoice/cosyvoice-tts

# Then generate audio:
curl -X POST http://localhost:8000/tts \
     -F "text=Hallo, ich bin Thorsten. Schön, dass du da bist." \
     --output thorsten.wav

→ Docker Hub: thorstenvoice/cosyvoice-tts

Manual Installation

1. Clone CosyVoice at the correct commit

git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git checkout ace7c47
git submodule update --init --recursive

2. Install dependencies

Python 3.10 or 3.11 recommended.

sudo apt-get install -y sox libsox-fmt-all ffmpeg
pip install setuptools --upgrade
pip install openai-whisper
grep -v "openai-whisper" requirements.txt > requirements_fixed.txt
pip install -r requirements_fixed.txt

3. Set PYTHONPATH

export PYTHONPATH=/path/to/CosyVoice:/path/to/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH

4. Download models

pip install huggingface_hub

# Base model
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
  --local-dir pretrained_models/CosyVoice3-0.5B

# Thorsten fine-tuned weights
hf download Thorsten-Voice/CosyVoice3 \
  --local-dir pretrained_models/CosyVoice3-0.5B \
  --include "llm.pt" "flow.pt" "spk2info.pt" "infer_thorsten.py"

5. Generate audio

python3 infer_thorsten.py \
  --text "Hallo, ich bin Thorsten. Schön, dass du da bist." \
  --output thorsten.wav

Performance

Benchmarked with these two test texts:

Short (~8 words):

"Hallo, hier ist Thorsten. Schön, dass Du da bist."

Long (~80 words):

"Für mich sind alle Menschen gleich, unabhängig von Geschlecht, sexueller Orientierung, Religion, Hautfarbe oder Geokoordinaten der Geburt. Ich glaube an eine globale Welt, wo jeder überall willkommen ist und freies Wissen und Bildung kostenfrei für jeden zur Verfügung steht. Ich habe meine Stimme der Allgemeinheit gespendet, in der Hoffnung darauf, dass sie in diesem Sinne genutzt wird."

Hardware	Short text	Long text
MacBook Air M1 (CPU)	47s	4:30 min
QNAP NAS Intel (CPU)	50s	—
RunPod RTX 4090 (GPU)	2.9s	12.9s

Python 3.12 patches

1. cosyvoice/flow/flow.py — add after conds = conds.transpose(1, 2) in CausalMaskedDiffWithDiT.forward():

min_len = min(h.shape[1], feat.shape[1])
h = h[:, :min_len, :]
feat = feat[:, :min_len, :]
conds = conds[:, :, :min_len]
mask = mask[:, :min_len]

2. third_party/Matcha-TTS/matcha/utils/__init__.py:

echo "" > third_party/Matcha-TTS/matcha/utils/__init__.py

Training details

Component	Base model	Epochs	Dataset
LLM	Fun-CosyVoice3-0.5B-2512	1	Thorsten-Voice 2022.10 (12,283 utterances)
Flow Decoder	Fun-CosyVoice3-0.5B-2512	9	Thorsten-Voice 2022.10 (12,283 utterances)
HiFi-GAN	Fun-CosyVoice3-0.5B-2512	—	not fine-tuned

Hardware: NVIDIA A40 (48 GB VRAM)

License

Apache 2.0 — same as the base model. The Thorsten-Voice dataset is licensed under CC0.

Citation

@article{du2025cosyvoice,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}

Space using Thorsten-Voice/CosyVoice3 1

Paper for Thorsten-Voice/CosyVoice3

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Paper • 2505.17589 • Published May 23, 2025 • 6

Thorsten-Voice
/

CosyVoice3