File size: 8,479 Bytes
a7fc1eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
license: other
license_name: yl4579-styletts2
license_link: https://github.com/yl4579/StyleTTS2#pre-requisites
language:
- en
library_name: coreml
tags:
- text-to-speech
- styletts2
- coreml
- apple-silicon
- libritts
- on-device
pipeline_tag: text-to-speech
inference: false
---

# StyleTTS2 (LibriTTS) β€” CoreML

Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
LibriTTS multi-speaker checkpoint
([`yl4579/StyleTTS2-LibriTTS` β†’ `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)).

Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on
the text-and-prosody predictor; fp32 decoder.

> [!IMPORTANT]
> **These weights carry use restrictions beyond MIT. Read the License
> section before downloading.** They are not a drop-in permissively-licensed
> TTS model. If you need permissive terms, use
> [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead.

## License & use restrictions

The upstream repository code is MIT, but the pre-trained LibriTTS weights
carry **two non-negotiable restrictions** declared in
[yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites):

1. **Synthetic-origin disclosure.** Any deployment that produces audio from
   these weights must clearly disclose to listeners that the audio is
   synthetic. No undisclosed synthetic-speech publishing.
2. **Speaker consent for voice cloning.** Cloning a real person's voice
   requires their consent. No unauthorized celebrity / public-figure /
   non-consenting third-party voice cloning.

These restrictions ride with the weights through every redistribution,
fine-tune, and downstream derivative. Anyone downloading this repo inherits
them and must propagate them in turn.

If you cannot or will not honor these terms, **do not download these
weights**.

License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
upstream README at the time of conversion (see *Conversion provenance* below
for the pinned commit).

## What's in this repo

| Package | Compute unit | Precision | Buckets | Called |
|---|---|---|---|---|
| `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` | ANE | fp16 | 5 token-length | 1Γ— per utterance |
| `styletts2_diffusion_step_512.mlpackage` | CPU+GPU | fp16 | 1 (B=512 only) | ~5Γ— per utterance |
| `styletts2_f0n_energy.mlpackage` | ANE | fp16 | dynamic | 1Γ— per utterance |
| `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` | CPU+GPU | **fp32** | 5 mel-length | 1Γ— per utterance |
| `constants/text_cleaner_vocab.json` | — | — | — | phoneme→id table |
| `config.json` | β€” | β€” | β€” | bundle runtime contract (audio/sampler/buckets) |

Total on-disk size: ~1.4 GB per format.

Both source `.mlpackage` (uncompiled, portable across Xcode versions) and
pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`)
are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one:

- **`*.mlpackage`** β€” load via `MLModel(contentsOf:)`; the OS compiles on
  first load (~5–20 s cold start the first time, cached afterward).
- **`compiled/*.mlmodelc`** β€” already compiled; same loader path skips the
  on-device compile. Useful for shipping inside an app bundle.

The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the
hard-alignment matrix (cumsum-of-durations β†’ one-hot β†’ matmul) live in your
host application (Swift / Python). Per-step inference is in CoreML; control
flow is not.

### Why the precision split looks like this

- **text_predictor is fp16.** Selective int8 PTQ was tried and dropped:
  on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of
  weight bandwidth, has no exposed int8 GEMM, and dequantizes back to
  fp16 on load. The savings did not justify the parity risk on small
  projections.
- **diffusion_step stays fp16.** It runs 5 times per utterance through an
  ODE-style sampler; quantization noise compounds through iterations.
  Same lesson as PocketTTS issue #7.
- **f0n_energy stays fp16.** ~6 MB. No bandwidth payoff; quantizing
  small projections injects audible pitch noise.
- **decoder is fp32, not fp16.** SineGen's harmonic source accumulates
  phase via `cumsum Γ— 2Ο€ Γ— hop=300`, reaching magnitudes ~4000
  mid-frame. fp16 precision at that magnitude (~4) is much larger than
  the per-sample increment (~0.05 rad), which scrambles the sine output
  and produces audibly robotic synthesis. fp32 is required end-to-end.

### Why only one diffusion bucket

Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256
buckets were dead weight (~192 MB) given the non-linear cost ladder
(B=32 β‰ˆ 66 ms/step, B=512 β‰ˆ 152 ms/step). Dropping them adds at most
~430 ms per utterance in the worst short case.

## Performance

- **RTFx:** 4.32Γ— warm on M-series Mac (5-step ADPM2 sampler, all buckets
  pre-warmed).
- **Log-mel cosine vs PyTorch fp32:** 0.9687.
- **ECAPA-TDNN cosine to reference clip:** 0.18 β€” at the model's
  architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the
  same metric. Voice-clone fidelity is bounded by StyleTTS2's
  architecture, not by this conversion.

## How to use

### Phonemizer

espeak-ng IPA + stress. The 178-token vocabulary in
`constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from
the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`.

Pad token is `$` at id 0.

### Inference shape

```text
text β†’ phonemes β†’ token ids
                     β”‚
                     β–Ό
text_predictor (ANE, int8)
   β”‚   β”œβ”€ d_en (1, T_dur, hidden)
   β”‚   β”œβ”€ s_pred (1, 256)             (sampler init via diffusion)
   β”‚   └─ duration logits β†’ duration β†’ one-hot alignment matrix (host)
   β”‚
   β–Ό
diffusion_step Γ— 5  (CPU+GPU, fp16)   (ADPM2 + Karras schedule + CFG)
   β”‚
   β–Ό
[blend(s, ref_s) + alignment]
   β”‚
   β–Ό
f0n_energy (ANE, fp16) β†’ F0_curve, N
   β”‚
   β–Ό
decoder (CPU+GPU, fp32) β†’ 24 kHz waveform
```

The Swift host owns the sampler loop, alignment construction, and bucket
routing. A reference Swift integration is in
[FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).

### Bucket routing

Round each variable-length input up to the next bucket. Pad with zeros.

| Input | Axis | Buckets |
|---|---|---|
| text_predictor `tokens` | T_tok | 32 / 64 / 128 / 256 / 512 |
| diffusion_step `embedding` | T_bert | 512 only (pad) |
| decoder `asr` | T_mel | 256 / 512 / 1024 / 2048 / 4096 |

f0n_energy is shape-flexible.

## Conversion provenance

- **Upstream code:** [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
- **Upstream weights:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS),
  file `Models/LibriTTS/epochs_2nd_00020.pth`
- **Conversion scripts:** [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46)
  (`models/tts/styletts2/scripts/`)
- **Quantization:** `coremltools.optimize.coreml.linear_quantize_weights`,
  `mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`,
  `weight_threshold=200_000`
- **Target:** `coremltools` β‰₯ 8.0, `minimum_deployment_target=iOS17`
  (macOS 14+ / iOS 17+)

## Known limitations

- **English (LibriTTS) only.** No multilingual support in this
  checkpoint.
- **HiFi-GAN decoder, not iSTFTNet.** LibriTTS upstream uses HiFi-GAN, so
  no `torch.stft` / complex tensors in the conversion path.
- **Decoder is fp32, not fp16.** Documented above. The mlpackage size
  reflects this (β‰ˆ210 MB per bucket).
- **Voice-clone fidelity ceiling is architectural.** ECAPA-TDNN cosine
  to reference clip β‰ˆ 0.18 here, β‰ˆ 0.29 in PyTorch fp32. The same-speaker
  threshold is ~0.30. This isn't a quantization or conversion artifact;
  see PR #46 TRIALS.md Phase 5.
- **No streaming.** Whole utterance only. Add chunked streaming on the
  host side if you need it.

## Citation & acknowledgments

- Yinghao Aaron Li et al. β€” StyleTTS2 architecture and LibriTTS
  checkpoint.
- LibriTTS authors (CC-BY-4.0 training data).
- espeak-ng β€” phonemization frontend.

```bibtex
@inproceedings{li2023styletts2,
  title  = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
  booktitle = {NeurIPS},
  year   = {2023}
}
```