StyleTTS-2-coreml / README.md
alexwengg's picture
Upload 101 files
a7fc1eb verified
---
license: other
license_name: yl4579-styletts2
license_link: https://github.com/yl4579/StyleTTS2#pre-requisites
language:
- en
library_name: coreml
tags:
- text-to-speech
- styletts2
- coreml
- apple-silicon
- libritts
- on-device
pipeline_tag: text-to-speech
inference: false
---
# StyleTTS2 (LibriTTS) β€” CoreML
Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
LibriTTS multi-speaker checkpoint
([`yl4579/StyleTTS2-LibriTTS` β†’ `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)).
Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on
the text-and-prosody predictor; fp32 decoder.
> [!IMPORTANT]
> **These weights carry use restrictions beyond MIT. Read the License
> section before downloading.** They are not a drop-in permissively-licensed
> TTS model. If you need permissive terms, use
> [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead.
## License & use restrictions
The upstream repository code is MIT, but the pre-trained LibriTTS weights
carry **two non-negotiable restrictions** declared in
[yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites):
1. **Synthetic-origin disclosure.** Any deployment that produces audio from
these weights must clearly disclose to listeners that the audio is
synthetic. No undisclosed synthetic-speech publishing.
2. **Speaker consent for voice cloning.** Cloning a real person's voice
requires their consent. No unauthorized celebrity / public-figure /
non-consenting third-party voice cloning.
These restrictions ride with the weights through every redistribution,
fine-tune, and downstream derivative. Anyone downloading this repo inherits
them and must propagate them in turn.
If you cannot or will not honor these terms, **do not download these
weights**.
License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
upstream README at the time of conversion (see *Conversion provenance* below
for the pinned commit).
## What's in this repo
| Package | Compute unit | Precision | Buckets | Called |
|---|---|---|---|---|
| `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` | ANE | fp16 | 5 token-length | 1Γ— per utterance |
| `styletts2_diffusion_step_512.mlpackage` | CPU+GPU | fp16 | 1 (B=512 only) | ~5Γ— per utterance |
| `styletts2_f0n_energy.mlpackage` | ANE | fp16 | dynamic | 1Γ— per utterance |
| `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` | CPU+GPU | **fp32** | 5 mel-length | 1Γ— per utterance |
| `constants/text_cleaner_vocab.json` | — | — | — | phoneme→id table |
| `config.json` | β€” | β€” | β€” | bundle runtime contract (audio/sampler/buckets) |
Total on-disk size: ~1.4 GB per format.
Both source `.mlpackage` (uncompiled, portable across Xcode versions) and
pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`)
are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one:
- **`*.mlpackage`** β€” load via `MLModel(contentsOf:)`; the OS compiles on
first load (~5–20 s cold start the first time, cached afterward).
- **`compiled/*.mlmodelc`** β€” already compiled; same loader path skips the
on-device compile. Useful for shipping inside an app bundle.
The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the
hard-alignment matrix (cumsum-of-durations β†’ one-hot β†’ matmul) live in your
host application (Swift / Python). Per-step inference is in CoreML; control
flow is not.
### Why the precision split looks like this
- **text_predictor is fp16.** Selective int8 PTQ was tried and dropped:
on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of
weight bandwidth, has no exposed int8 GEMM, and dequantizes back to
fp16 on load. The savings did not justify the parity risk on small
projections.
- **diffusion_step stays fp16.** It runs 5 times per utterance through an
ODE-style sampler; quantization noise compounds through iterations.
Same lesson as PocketTTS issue #7.
- **f0n_energy stays fp16.** ~6 MB. No bandwidth payoff; quantizing
small projections injects audible pitch noise.
- **decoder is fp32, not fp16.** SineGen's harmonic source accumulates
phase via `cumsum Γ— 2Ο€ Γ— hop=300`, reaching magnitudes ~4000
mid-frame. fp16 precision at that magnitude (~4) is much larger than
the per-sample increment (~0.05 rad), which scrambles the sine output
and produces audibly robotic synthesis. fp32 is required end-to-end.
### Why only one diffusion bucket
Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256
buckets were dead weight (~192 MB) given the non-linear cost ladder
(B=32 β‰ˆ 66 ms/step, B=512 β‰ˆ 152 ms/step). Dropping them adds at most
~430 ms per utterance in the worst short case.
## Performance
- **RTFx:** 4.32Γ— warm on M-series Mac (5-step ADPM2 sampler, all buckets
pre-warmed).
- **Log-mel cosine vs PyTorch fp32:** 0.9687.
- **ECAPA-TDNN cosine to reference clip:** 0.18 β€” at the model's
architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the
same metric. Voice-clone fidelity is bounded by StyleTTS2's
architecture, not by this conversion.
## How to use
### Phonemizer
espeak-ng IPA + stress. The 178-token vocabulary in
`constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from
the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`.
Pad token is `$` at id 0.
### Inference shape
```text
text β†’ phonemes β†’ token ids
β”‚
β–Ό
text_predictor (ANE, int8)
β”‚ β”œβ”€ d_en (1, T_dur, hidden)
β”‚ β”œβ”€ s_pred (1, 256) (sampler init via diffusion)
β”‚ └─ duration logits β†’ duration β†’ one-hot alignment matrix (host)
β”‚
β–Ό
diffusion_step Γ— 5 (CPU+GPU, fp16) (ADPM2 + Karras schedule + CFG)
β”‚
β–Ό
[blend(s, ref_s) + alignment]
β”‚
β–Ό
f0n_energy (ANE, fp16) β†’ F0_curve, N
β”‚
β–Ό
decoder (CPU+GPU, fp32) β†’ 24 kHz waveform
```
The Swift host owns the sampler loop, alignment construction, and bucket
routing. A reference Swift integration is in
[FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).
### Bucket routing
Round each variable-length input up to the next bucket. Pad with zeros.
| Input | Axis | Buckets |
|---|---|---|
| text_predictor `tokens` | T_tok | 32 / 64 / 128 / 256 / 512 |
| diffusion_step `embedding` | T_bert | 512 only (pad) |
| decoder `asr` | T_mel | 256 / 512 / 1024 / 2048 / 4096 |
f0n_energy is shape-flexible.
## Conversion provenance
- **Upstream code:** [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
- **Upstream weights:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS),
file `Models/LibriTTS/epochs_2nd_00020.pth`
- **Conversion scripts:** [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46)
(`models/tts/styletts2/scripts/`)
- **Quantization:** `coremltools.optimize.coreml.linear_quantize_weights`,
`mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`,
`weight_threshold=200_000`
- **Target:** `coremltools` β‰₯ 8.0, `minimum_deployment_target=iOS17`
(macOS 14+ / iOS 17+)
## Known limitations
- **English (LibriTTS) only.** No multilingual support in this
checkpoint.
- **HiFi-GAN decoder, not iSTFTNet.** LibriTTS upstream uses HiFi-GAN, so
no `torch.stft` / complex tensors in the conversion path.
- **Decoder is fp32, not fp16.** Documented above. The mlpackage size
reflects this (β‰ˆ210 MB per bucket).
- **Voice-clone fidelity ceiling is architectural.** ECAPA-TDNN cosine
to reference clip β‰ˆ 0.18 here, β‰ˆ 0.29 in PyTorch fp32. The same-speaker
threshold is ~0.30. This isn't a quantization or conversion artifact;
see PR #46 TRIALS.md Phase 5.
- **No streaming.** Whole utterance only. Add chunked streaming on the
host side if you need it.
## Citation & acknowledgments
- Yinghao Aaron Li et al. β€” StyleTTS2 architecture and LibriTTS
checkpoint.
- LibriTTS authors (CC-BY-4.0 training data).
- espeak-ng β€” phonemization frontend.
```bibtex
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
booktitle = {NeurIPS},
year = {2023}
}
```