CosyVoice3 (Mandarin) β€” CoreML Models for FluidAudio

CoreML conversions of CosyVoice3's four inference stages, frozen to the exact shapes the FluidAudio Swift package's CosyVoice3TtsManager loads at runtime. Targets Apple Silicon (M-series) with the Neural Engine for LLM + HiFT, CPU for Flow.

A default voice ships in voices/ so the repo is self-contained. Additional voices (as they're extracted) live in the companion repo FluidInference/cosyvoice3-voices-zh.

Shipping configuration (frozen)

Each model is shipped in two formats: .mlpackage (source, portable) and .mlmodelc (pre-compiled for macOS 14 / iOS 17 + Apple Silicon). Swift can load either; .mlmodelc skips the one-time compile step on first use (~20-30 s for Flow without it).

Model Compute Purpose dtype
LLM-Prefill-T256-M768-fp16 CPU + ANE Qwen2-0.5B prefill, 256-token context, 768-slot KV cache fp16
LLM-Decode-M768-fp16 CPU + ANE Single-step AR decode, 768-slot KV cache, 24 layers Γ— 2 KV heads Γ— 64 dim fp16
Flow-N250-fp16 CPU + GPU Speech-token β†’ mel (80-bin, 24 kHz), N_total=250 fp16 (pure CPU overflows fused LayerNorm β†’ NaN; ANE refuses to compile; GPU path uses fp32 accumulators internally and is stable)
HiFT-T500-fp16 CPU + ANE Mel β†’ 24 kHz PCM, T=500 frames fp16

Total disk footprint (.mlmodelc + .mlpackage + runtime tables): ~6.6 GB on disk. If you only need one format, delete the other after download.

Runtime tables

embeddings/

  • embeddings-runtime-fp32.safetensors β€” 542 MB. Qwen2 model.embed_tokens.weight at runtime (post-.float()) dtype. Required for bit-exact parity with the Python reference β€” shipping raw .pt weights introduces ~4.7e-4 error through the HuggingFace dtype round-trip. Swift mmaps this file.
  • speech_embedding-fp16.safetensors β€” 12 MB. CosyVoice3 speech_embedding table (6761 Γ— 896 fp16); row-lookup per decoded speech token.

voices/ β€” 11 zero-shot voice bundles (~1 MB total)

  • cosyvoice3-default-zh.safetensors β€” default voice from CosyVoice upstream zero_shot_prompt.wav (female, εΈŒζœ›δ½ δ»₯εŽθƒ½ε€Ÿεšηš„ζ―”ζˆ‘θΏ˜ε₯½ε‘¦γ€‚, N_speech = 87).
  • aishell3-zh-SSB*.safetensors β€” 10 AISHELL-3 speakers bootstrapped via verify/bootstrap_aishell3_voices.py (5 female + 5 male, north + south accents). See aishell3-bootstrap.json for per-voice provenance.
  • Each .safetensors ships with a .json prompt-text sidecar and follows the schema documented in the companion cosyvoice3-voices-zh repo.

tokenizer/

  • vocab.json + merges.txt + tokenizer_config.json β€” stock Qwen2 BPE tokenizer assets (copied from HuggingFace FunAudioLLM/CosyVoice-BlankEN).
  • special_tokens.json β€” 281 runtime-added CosyVoice3 special token β†’ ID map (<|endofprompt|>, [breath], ARPAbet phonemes, etc.). Covers IDs 151643..151923.

Swift usage (FluidAudio)

import FluidAudio

let manager = CosyVoice3TtsManager(
    modelsDirectory:     modelsURL,                            // this repo root
    tokenizerDirectory:  modelsURL.appendingPathComponent("tokenizer"),
    textEmbeddingsFile:  modelsURL.appendingPathComponent("embeddings/embeddings-runtime-fp32.safetensors"),
    specialTokensFile:   modelsURL.appendingPathComponent("tokenizer/special_tokens.json"))
try await manager.initialize()

let prompt = try CosyVoice3PromptAssets.load(
    from: voiceURL.appendingPathComponent("cosyvoice3-default-zh.safetensors"))

let result = try await manager.synthesize(
    text: "δ»Šε€©ε€©ζ°”ηœŸηš„εΎˆδΈι”™οΌŒι€‚εˆε‡Ίι—¨ζ•£ζ­₯。",
    promptAssets: prompt)
// result.samples β€” [Float] @ 24 kHz mono

Model graph quick reference

  • Qwen2 decoder: hidden=896, 24 layers, 14 Q heads, 2 KV heads, head_dim=64
  • Speech vocab: 6761 (6561 tokens + sos/eos/task_id/stops)
  • SOS=6561, EOS=6562, TASK_ID=6563
  • Flow: 80-bin mel @ 24 kHz, hop=480, n_fft=1920
  • HiFT: iSTFT-based vocoder, upsamples mel to 24 kHz PCM

License

Apache-2.0. Derived from FunAudioLLM/CosyVoice3 weights; see upstream license.

Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including FluidInference/CosyVoice3-0.5B-coreml