CosyVoice3 (Mandarin) β CoreML Models for FluidAudio
CoreML conversions of CosyVoice3's four inference stages, frozen to the exact
shapes the FluidAudio Swift
package's CosyVoice3TtsManager loads at runtime. Targets Apple Silicon
(M-series) with the Neural Engine for LLM + HiFT, CPU for Flow.
A default voice ships in voices/ so the repo is self-contained. Additional
voices (as they're extracted) live in the companion repo
FluidInference/cosyvoice3-voices-zh.
Shipping configuration (frozen)
Each model is shipped in two formats: .mlpackage (source, portable) and
.mlmodelc (pre-compiled for macOS 14 / iOS 17 + Apple Silicon). Swift can
load either; .mlmodelc skips the one-time compile step on first use
(~20-30 s for Flow without it).
| Model | Compute | Purpose | dtype |
|---|---|---|---|
LLM-Prefill-T256-M768-fp16 |
CPU + ANE | Qwen2-0.5B prefill, 256-token context, 768-slot KV cache | fp16 |
LLM-Decode-M768-fp16 |
CPU + ANE | Single-step AR decode, 768-slot KV cache, 24 layers Γ 2 KV heads Γ 64 dim | fp16 |
Flow-N250-fp16 |
CPU + GPU | Speech-token β mel (80-bin, 24 kHz), N_total=250 | fp16 (pure CPU overflows fused LayerNorm β NaN; ANE refuses to compile; GPU path uses fp32 accumulators internally and is stable) |
HiFT-T500-fp16 |
CPU + ANE | Mel β 24 kHz PCM, T=500 frames | fp16 |
Total disk footprint (.mlmodelc + .mlpackage + runtime tables): ~6.6 GB on
disk. If you only need one format, delete the other after download.
Runtime tables
embeddings/
embeddings-runtime-fp32.safetensorsβ 542 MB. Qwen2model.embed_tokens.weightat runtime (post-.float()) dtype. Required for bit-exact parity with the Python reference β shipping raw.ptweights introduces ~4.7e-4 error through the HuggingFace dtype round-trip. Swift mmaps this file.speech_embedding-fp16.safetensorsβ 12 MB. CosyVoice3speech_embeddingtable (6761 Γ 896 fp16); row-lookup per decoded speech token.
voices/ β 11 zero-shot voice bundles (~1 MB total)
cosyvoice3-default-zh.safetensorsβ default voice from CosyVoice upstreamzero_shot_prompt.wav(female, εΈζδ½ δ»₯εθ½ε€εηζ―ζθΏε₯½ε¦γ, N_speech = 87).aishell3-zh-SSB*.safetensorsβ 10 AISHELL-3 speakers bootstrapped viaverify/bootstrap_aishell3_voices.py(5 female + 5 male, north + south accents). Seeaishell3-bootstrap.jsonfor per-voice provenance.- Each
.safetensorsships with a.jsonprompt-text sidecar and follows the schema documented in the companioncosyvoice3-voices-zhrepo.
tokenizer/
vocab.json+merges.txt+tokenizer_config.jsonβ stock Qwen2 BPE tokenizer assets (copied from HuggingFaceFunAudioLLM/CosyVoice-BlankEN).special_tokens.jsonβ 281 runtime-added CosyVoice3 special token β ID map (<|endofprompt|>,[breath], ARPAbet phonemes, etc.). Covers IDs 151643..151923.
Swift usage (FluidAudio)
import FluidAudio
let manager = CosyVoice3TtsManager(
modelsDirectory: modelsURL, // this repo root
tokenizerDirectory: modelsURL.appendingPathComponent("tokenizer"),
textEmbeddingsFile: modelsURL.appendingPathComponent("embeddings/embeddings-runtime-fp32.safetensors"),
specialTokensFile: modelsURL.appendingPathComponent("tokenizer/special_tokens.json"))
try await manager.initialize()
let prompt = try CosyVoice3PromptAssets.load(
from: voiceURL.appendingPathComponent("cosyvoice3-default-zh.safetensors"))
let result = try await manager.synthesize(
text: "δ»ε€©ε€©ζ°ηηεΎδΈιοΌιεεΊι¨ζ£ζ₯γ",
promptAssets: prompt)
// result.samples β [Float] @ 24 kHz mono
Model graph quick reference
- Qwen2 decoder: hidden=896, 24 layers, 14 Q heads, 2 KV heads, head_dim=64
- Speech vocab: 6761 (6561 tokens + sos/eos/task_id/stops)
- SOS=6561, EOS=6562, TASK_ID=6563
- Flow: 80-bin mel @ 24 kHz, hop=480, n_fft=1920
- HiFT: iSTFT-based vocoder, upsamples mel to 24 kHz PCM
License
Apache-2.0. Derived from FunAudioLLM/CosyVoice3 weights; see upstream license.
- Downloads last month
- 84