File size: 8,479 Bytes

a7fc1eb

---
license: other
license_name: yl4579-styletts2
license_link: https://github.com/yl4579/StyleTTS2#pre-requisites
language:
- en
library_name: coreml
tags:
- text-to-speech
- styletts2
- coreml
- apple-silicon
- libritts
- on-device
pipeline_tag: text-to-speech
inference: false
---

# StyleTTS2 (LibriTTS) — CoreML

Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
LibriTTS multi-speaker checkpoint
([`yl4579/StyleTTS2-LibriTTS` → `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)).

Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on
the text-and-prosody predictor; fp32 decoder.

> [!IMPORTANT]
> **These weights carry use restrictions beyond MIT. Read the License
> section before downloading.** They are not a drop-in permissively-licensed
> TTS model. If you need permissive terms, use
> [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead.

## License & use restrictions

The upstream repository code is MIT, but the pre-trained LibriTTS weights
carry **two non-negotiable restrictions** declared in
[yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites):

1. **Synthetic-origin disclosure.** Any deployment that produces audio from
   these weights must clearly disclose to listeners that the audio is
   synthetic. No undisclosed synthetic-speech publishing.
2. **Speaker consent for voice cloning.** Cloning a real person's voice
   requires their consent. No unauthorized celebrity / public-figure /
   non-consenting third-party voice cloning.

These restrictions ride with the weights through every redistribution,
fine-tune, and downstream derivative. Anyone downloading this repo inherits
them and must propagate them in turn.

If you cannot or will not honor these terms, **do not download these
weights**.

License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
upstream README at the time of conversion (see *Conversion provenance* below
for the pinned commit).

## What's in this repo

| Package | Compute unit | Precision | Buckets | Called |
|---|---|---|---|---|
| `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` | ANE | fp16 | 5 token-length | 1× per utterance |
| `styletts2_diffusion_step_512.mlpackage` | CPU+GPU | fp16 | 1 (B=512 only) | ~5× per utterance |
| `styletts2_f0n_energy.mlpackage` | ANE | fp16 | dynamic | 1× per utterance |
| `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` | CPU+GPU | **fp32** | 5 mel-length | 1× per utterance |
| `constants/text_cleaner_vocab.json` | — | — | — | phoneme→id table |
| `config.json` | — | — | — | bundle runtime contract (audio/sampler/buckets) |

Total on-disk size: ~1.4 GB per format.

Both source `.mlpackage` (uncompiled, portable across Xcode versions) and
pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`)
are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one:

- **`*.mlpackage`** — load via `MLModel(contentsOf:)`; the OS compiles on
  first load (~5–20 s cold start the first time, cached afterward).
- **`compiled/*.mlmodelc`** — already compiled; same loader path skips the
  on-device compile. Useful for shipping inside an app bundle.

The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the
hard-alignment matrix (cumsum-of-durations → one-hot → matmul) live in your
host application (Swift / Python). Per-step inference is in CoreML; control
flow is not.

### Why the precision split looks like this

- **text_predictor is fp16.** Selective int8 PTQ was tried and dropped:
  on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of
  weight bandwidth, has no exposed int8 GEMM, and dequantizes back to
  fp16 on load. The savings did not justify the parity risk on small
  projections.
- **diffusion_step stays fp16.** It runs 5 times per utterance through an
  ODE-style sampler; quantization noise compounds through iterations.
  Same lesson as PocketTTS issue #7.
- **f0n_energy stays fp16.** ~6 MB. No bandwidth payoff; quantizing
  small projections injects audible pitch noise.
- **decoder is fp32, not fp16.** SineGen's harmonic source accumulates
  phase via `cumsum × 2π × hop=300`, reaching magnitudes ~4000
  mid-frame. fp16 precision at that magnitude (~4) is much larger than
  the per-sample increment (~0.05 rad), which scrambles the sine output
  and produces audibly robotic synthesis. fp32 is required end-to-end.

### Why only one diffusion bucket

Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256
buckets were dead weight (~192 MB) given the non-linear cost ladder
(B=32 ≈ 66 ms/step, B=512 ≈ 152 ms/step). Dropping them adds at most
~430 ms per utterance in the worst short case.

## Performance

- **RTFx:** 4.32× warm on M-series Mac (5-step ADPM2 sampler, all buckets
  pre-warmed).
- **Log-mel cosine vs PyTorch fp32:** 0.9687.
- **ECAPA-TDNN cosine to reference clip:** 0.18 — at the model's
  architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the
  same metric. Voice-clone fidelity is bounded by StyleTTS2's
  architecture, not by this conversion.

## How to use

### Phonemizer

espeak-ng IPA + stress. The 178-token vocabulary in
`constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from
the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`.

Pad token is `$` at id 0.

### Inference shape

```text
text → phonemes → token ids
                     │
                     ▼
text_predictor (ANE, int8)
   │   ├─ d_en (1, T_dur, hidden)
   │   ├─ s_pred (1, 256)             (sampler init via diffusion)
   │   └─ duration logits → duration → one-hot alignment matrix (host)
   │
   ▼
diffusion_step × 5  (CPU+GPU, fp16)   (ADPM2 + Karras schedule + CFG)
   │
   ▼
[blend(s, ref_s) + alignment]
   │
   ▼
f0n_energy (ANE, fp16) → F0_curve, N
   │
   ▼
decoder (CPU+GPU, fp32) → 24 kHz waveform
```

The Swift host owns the sampler loop, alignment construction, and bucket
routing. A reference Swift integration is in
[FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).

### Bucket routing

Round each variable-length input up to the next bucket. Pad with zeros.

| Input | Axis | Buckets |
|---|---|---|
| text_predictor `tokens` | T_tok | 32 / 64 / 128 / 256 / 512 |
| diffusion_step `embedding` | T_bert | 512 only (pad) |
| decoder `asr` | T_mel | 256 / 512 / 1024 / 2048 / 4096 |

f0n_energy is shape-flexible.

## Conversion provenance

- **Upstream code:** [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
- **Upstream weights:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS),
  file `Models/LibriTTS/epochs_2nd_00020.pth`
- **Conversion scripts:** [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46)
  (`models/tts/styletts2/scripts/`)
- **Quantization:** `coremltools.optimize.coreml.linear_quantize_weights`,
  `mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`,
  `weight_threshold=200_000`
- **Target:** `coremltools` ≥ 8.0, `minimum_deployment_target=iOS17`
  (macOS 14+ / iOS 17+)

## Known limitations

- **English (LibriTTS) only.** No multilingual support in this
  checkpoint.
- **HiFi-GAN decoder, not iSTFTNet.** LibriTTS upstream uses HiFi-GAN, so
  no `torch.stft` / complex tensors in the conversion path.
- **Decoder is fp32, not fp16.** Documented above. The mlpackage size
  reflects this (≈210 MB per bucket).
- **Voice-clone fidelity ceiling is architectural.** ECAPA-TDNN cosine
  to reference clip ≈ 0.18 here, ≈ 0.29 in PyTorch fp32. The same-speaker
  threshold is ~0.30. This isn't a quantization or conversion artifact;
  see PR #46 TRIALS.md Phase 5.
- **No streaming.** Whole utterance only. Add chunked streaming on the
  host side if you need it.

## Citation & acknowledgments

- Yinghao Aaron Li et al. — StyleTTS2 architecture and LibriTTS
  checkpoint.
- LibriTTS authors (CC-BY-4.0 training data).
- espeak-ng — phonemization frontend.

```bibtex
@inproceedings{li2023styletts2,
  title  = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
  booktitle = {NeurIPS},
  year   = {2023}
}
```