File size: 8,479 Bytes
a7fc1eb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | ---
license: other
license_name: yl4579-styletts2
license_link: https://github.com/yl4579/StyleTTS2#pre-requisites
language:
- en
library_name: coreml
tags:
- text-to-speech
- styletts2
- coreml
- apple-silicon
- libritts
- on-device
pipeline_tag: text-to-speech
inference: false
---
# StyleTTS2 (LibriTTS) β CoreML
Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
LibriTTS multi-speaker checkpoint
([`yl4579/StyleTTS2-LibriTTS` β `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)).
Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on
the text-and-prosody predictor; fp32 decoder.
> [!IMPORTANT]
> **These weights carry use restrictions beyond MIT. Read the License
> section before downloading.** They are not a drop-in permissively-licensed
> TTS model. If you need permissive terms, use
> [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead.
## License & use restrictions
The upstream repository code is MIT, but the pre-trained LibriTTS weights
carry **two non-negotiable restrictions** declared in
[yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites):
1. **Synthetic-origin disclosure.** Any deployment that produces audio from
these weights must clearly disclose to listeners that the audio is
synthetic. No undisclosed synthetic-speech publishing.
2. **Speaker consent for voice cloning.** Cloning a real person's voice
requires their consent. No unauthorized celebrity / public-figure /
non-consenting third-party voice cloning.
These restrictions ride with the weights through every redistribution,
fine-tune, and downstream derivative. Anyone downloading this repo inherits
them and must propagate them in turn.
If you cannot or will not honor these terms, **do not download these
weights**.
License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
upstream README at the time of conversion (see *Conversion provenance* below
for the pinned commit).
## What's in this repo
| Package | Compute unit | Precision | Buckets | Called |
|---|---|---|---|---|
| `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` | ANE | fp16 | 5 token-length | 1Γ per utterance |
| `styletts2_diffusion_step_512.mlpackage` | CPU+GPU | fp16 | 1 (B=512 only) | ~5Γ per utterance |
| `styletts2_f0n_energy.mlpackage` | ANE | fp16 | dynamic | 1Γ per utterance |
| `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` | CPU+GPU | **fp32** | 5 mel-length | 1Γ per utterance |
| `constants/text_cleaner_vocab.json` | β | β | β | phonemeβid table |
| `config.json` | β | β | β | bundle runtime contract (audio/sampler/buckets) |
Total on-disk size: ~1.4 GB per format.
Both source `.mlpackage` (uncompiled, portable across Xcode versions) and
pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`)
are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one:
- **`*.mlpackage`** β load via `MLModel(contentsOf:)`; the OS compiles on
first load (~5β20 s cold start the first time, cached afterward).
- **`compiled/*.mlmodelc`** β already compiled; same loader path skips the
on-device compile. Useful for shipping inside an app bundle.
The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the
hard-alignment matrix (cumsum-of-durations β one-hot β matmul) live in your
host application (Swift / Python). Per-step inference is in CoreML; control
flow is not.
### Why the precision split looks like this
- **text_predictor is fp16.** Selective int8 PTQ was tried and dropped:
on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of
weight bandwidth, has no exposed int8 GEMM, and dequantizes back to
fp16 on load. The savings did not justify the parity risk on small
projections.
- **diffusion_step stays fp16.** It runs 5 times per utterance through an
ODE-style sampler; quantization noise compounds through iterations.
Same lesson as PocketTTS issue #7.
- **f0n_energy stays fp16.** ~6 MB. No bandwidth payoff; quantizing
small projections injects audible pitch noise.
- **decoder is fp32, not fp16.** SineGen's harmonic source accumulates
phase via `cumsum Γ 2Ο Γ hop=300`, reaching magnitudes ~4000
mid-frame. fp16 precision at that magnitude (~4) is much larger than
the per-sample increment (~0.05 rad), which scrambles the sine output
and produces audibly robotic synthesis. fp32 is required end-to-end.
### Why only one diffusion bucket
Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256
buckets were dead weight (~192 MB) given the non-linear cost ladder
(B=32 β 66 ms/step, B=512 β 152 ms/step). Dropping them adds at most
~430 ms per utterance in the worst short case.
## Performance
- **RTFx:** 4.32Γ warm on M-series Mac (5-step ADPM2 sampler, all buckets
pre-warmed).
- **Log-mel cosine vs PyTorch fp32:** 0.9687.
- **ECAPA-TDNN cosine to reference clip:** 0.18 β at the model's
architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the
same metric. Voice-clone fidelity is bounded by StyleTTS2's
architecture, not by this conversion.
## How to use
### Phonemizer
espeak-ng IPA + stress. The 178-token vocabulary in
`constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from
the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`.
Pad token is `$` at id 0.
### Inference shape
```text
text β phonemes β token ids
β
βΌ
text_predictor (ANE, int8)
β ββ d_en (1, T_dur, hidden)
β ββ s_pred (1, 256) (sampler init via diffusion)
β ββ duration logits β duration β one-hot alignment matrix (host)
β
βΌ
diffusion_step Γ 5 (CPU+GPU, fp16) (ADPM2 + Karras schedule + CFG)
β
βΌ
[blend(s, ref_s) + alignment]
β
βΌ
f0n_energy (ANE, fp16) β F0_curve, N
β
βΌ
decoder (CPU+GPU, fp32) β 24 kHz waveform
```
The Swift host owns the sampler loop, alignment construction, and bucket
routing. A reference Swift integration is in
[FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).
### Bucket routing
Round each variable-length input up to the next bucket. Pad with zeros.
| Input | Axis | Buckets |
|---|---|---|
| text_predictor `tokens` | T_tok | 32 / 64 / 128 / 256 / 512 |
| diffusion_step `embedding` | T_bert | 512 only (pad) |
| decoder `asr` | T_mel | 256 / 512 / 1024 / 2048 / 4096 |
f0n_energy is shape-flexible.
## Conversion provenance
- **Upstream code:** [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
- **Upstream weights:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS),
file `Models/LibriTTS/epochs_2nd_00020.pth`
- **Conversion scripts:** [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46)
(`models/tts/styletts2/scripts/`)
- **Quantization:** `coremltools.optimize.coreml.linear_quantize_weights`,
`mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`,
`weight_threshold=200_000`
- **Target:** `coremltools` β₯ 8.0, `minimum_deployment_target=iOS17`
(macOS 14+ / iOS 17+)
## Known limitations
- **English (LibriTTS) only.** No multilingual support in this
checkpoint.
- **HiFi-GAN decoder, not iSTFTNet.** LibriTTS upstream uses HiFi-GAN, so
no `torch.stft` / complex tensors in the conversion path.
- **Decoder is fp32, not fp16.** Documented above. The mlpackage size
reflects this (β210 MB per bucket).
- **Voice-clone fidelity ceiling is architectural.** ECAPA-TDNN cosine
to reference clip β 0.18 here, β 0.29 in PyTorch fp32. The same-speaker
threshold is ~0.30. This isn't a quantization or conversion artifact;
see PR #46 TRIALS.md Phase 5.
- **No streaming.** Whole utterance only. Add chunked streaming on the
host side if you need it.
## Citation & acknowledgments
- Yinghao Aaron Li et al. β StyleTTS2 architecture and LibriTTS
checkpoint.
- LibriTTS authors (CC-BY-4.0 training data).
- espeak-ng β phonemization frontend.
```bibtex
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
booktitle = {NeurIPS},
year = {2023}
}
```
|