| --- |
| license: other |
| license_name: yl4579-styletts2 |
| license_link: https://github.com/yl4579/StyleTTS2#pre-requisites |
| language: |
| - en |
| library_name: coreml |
| tags: |
| - text-to-speech |
| - styletts2 |
| - coreml |
| - apple-silicon |
| - libritts |
| - on-device |
| pipeline_tag: text-to-speech |
| inference: false |
| --- |
| |
| # StyleTTS2 (LibriTTS) β CoreML |
|
|
| Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) |
| LibriTTS multi-speaker checkpoint |
| ([`yl4579/StyleTTS2-LibriTTS` β `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)). |
|
|
| Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on |
| the text-and-prosody predictor; fp32 decoder. |
|
|
| > [!IMPORTANT] |
| > **These weights carry use restrictions beyond MIT. Read the License |
| > section before downloading.** They are not a drop-in permissively-licensed |
| > TTS model. If you need permissive terms, use |
| > [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead. |
|
|
| ## License & use restrictions |
|
|
| The upstream repository code is MIT, but the pre-trained LibriTTS weights |
| carry **two non-negotiable restrictions** declared in |
| [yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites): |
|
|
| 1. **Synthetic-origin disclosure.** Any deployment that produces audio from |
| these weights must clearly disclose to listeners that the audio is |
| synthetic. No undisclosed synthetic-speech publishing. |
| 2. **Speaker consent for voice cloning.** Cloning a real person's voice |
| requires their consent. No unauthorized celebrity / public-figure / |
| non-consenting third-party voice cloning. |
|
|
| These restrictions ride with the weights through every redistribution, |
| fine-tune, and downstream derivative. Anyone downloading this repo inherits |
| them and must propagate them in turn. |
|
|
| If you cannot or will not honor these terms, **do not download these |
| weights**. |
|
|
| License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) |
| upstream README at the time of conversion (see *Conversion provenance* below |
| for the pinned commit). |
|
|
| ## What's in this repo |
|
|
| | Package | Compute unit | Precision | Buckets | Called | |
| |---|---|---|---|---| |
| | `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` | ANE | fp16 | 5 token-length | 1Γ per utterance | |
| | `styletts2_diffusion_step_512.mlpackage` | CPU+GPU | fp16 | 1 (B=512 only) | ~5Γ per utterance | |
| | `styletts2_f0n_energy.mlpackage` | ANE | fp16 | dynamic | 1Γ per utterance | |
| | `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` | CPU+GPU | **fp32** | 5 mel-length | 1Γ per utterance | |
| | `constants/text_cleaner_vocab.json` | β | β | β | phonemeβid table | |
| | `config.json` | β | β | β | bundle runtime contract (audio/sampler/buckets) | |
|
|
| Total on-disk size: ~1.4 GB per format. |
|
|
| Both source `.mlpackage` (uncompiled, portable across Xcode versions) and |
| pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`) |
| are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one: |
|
|
| - **`*.mlpackage`** β load via `MLModel(contentsOf:)`; the OS compiles on |
| first load (~5β20 s cold start the first time, cached afterward). |
| - **`compiled/*.mlmodelc`** β already compiled; same loader path skips the |
| on-device compile. Useful for shipping inside an app bundle. |
|
|
| The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the |
| hard-alignment matrix (cumsum-of-durations β one-hot β matmul) live in your |
| host application (Swift / Python). Per-step inference is in CoreML; control |
| flow is not. |
|
|
| ### Why the precision split looks like this |
|
|
| - **text_predictor is fp16.** Selective int8 PTQ was tried and dropped: |
| on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of |
| weight bandwidth, has no exposed int8 GEMM, and dequantizes back to |
| fp16 on load. The savings did not justify the parity risk on small |
| projections. |
| - **diffusion_step stays fp16.** It runs 5 times per utterance through an |
| ODE-style sampler; quantization noise compounds through iterations. |
| Same lesson as PocketTTS issue #7. |
| - **f0n_energy stays fp16.** ~6 MB. No bandwidth payoff; quantizing |
| small projections injects audible pitch noise. |
| - **decoder is fp32, not fp16.** SineGen's harmonic source accumulates |
| phase via `cumsum Γ 2Ο Γ hop=300`, reaching magnitudes ~4000 |
| mid-frame. fp16 precision at that magnitude (~4) is much larger than |
| the per-sample increment (~0.05 rad), which scrambles the sine output |
| and produces audibly robotic synthesis. fp32 is required end-to-end. |
| |
| ### Why only one diffusion bucket |
| |
| Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256 |
| buckets were dead weight (~192 MB) given the non-linear cost ladder |
| (B=32 β 66 ms/step, B=512 β 152 ms/step). Dropping them adds at most |
| ~430 ms per utterance in the worst short case. |
| |
| ## Performance |
| |
| - **RTFx:** 4.32Γ warm on M-series Mac (5-step ADPM2 sampler, all buckets |
| pre-warmed). |
| - **Log-mel cosine vs PyTorch fp32:** 0.9687. |
| - **ECAPA-TDNN cosine to reference clip:** 0.18 β at the model's |
| architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the |
| same metric. Voice-clone fidelity is bounded by StyleTTS2's |
| architecture, not by this conversion. |
|
|
| ## How to use |
|
|
| ### Phonemizer |
|
|
| espeak-ng IPA + stress. The 178-token vocabulary in |
| `constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from |
| the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`. |
|
|
| Pad token is `$` at id 0. |
|
|
| ### Inference shape |
|
|
| ```text |
| text β phonemes β token ids |
| β |
| βΌ |
| text_predictor (ANE, int8) |
| β ββ d_en (1, T_dur, hidden) |
| β ββ s_pred (1, 256) (sampler init via diffusion) |
| β ββ duration logits β duration β one-hot alignment matrix (host) |
| β |
| βΌ |
| diffusion_step Γ 5 (CPU+GPU, fp16) (ADPM2 + Karras schedule + CFG) |
| β |
| βΌ |
| [blend(s, ref_s) + alignment] |
| β |
| βΌ |
| f0n_energy (ANE, fp16) β F0_curve, N |
| β |
| βΌ |
| decoder (CPU+GPU, fp32) β 24 kHz waveform |
| ``` |
|
|
| The Swift host owns the sampler loop, alignment construction, and bucket |
| routing. A reference Swift integration is in |
| [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio). |
|
|
| ### Bucket routing |
|
|
| Round each variable-length input up to the next bucket. Pad with zeros. |
|
|
| | Input | Axis | Buckets | |
| |---|---|---| |
| | text_predictor `tokens` | T_tok | 32 / 64 / 128 / 256 / 512 | |
| | diffusion_step `embedding` | T_bert | 512 only (pad) | |
| | decoder `asr` | T_mel | 256 / 512 / 1024 / 2048 / 4096 | |
| |
| f0n_energy is shape-flexible. |
|
|
| ## Conversion provenance |
|
|
| - **Upstream code:** [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) |
| - **Upstream weights:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS), |
| file `Models/LibriTTS/epochs_2nd_00020.pth` |
| - **Conversion scripts:** [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46) |
| (`models/tts/styletts2/scripts/`) |
| - **Quantization:** `coremltools.optimize.coreml.linear_quantize_weights`, |
| `mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`, |
| `weight_threshold=200_000` |
| - **Target:** `coremltools` β₯ 8.0, `minimum_deployment_target=iOS17` |
| (macOS 14+ / iOS 17+) |
|
|
| ## Known limitations |
|
|
| - **English (LibriTTS) only.** No multilingual support in this |
| checkpoint. |
| - **HiFi-GAN decoder, not iSTFTNet.** LibriTTS upstream uses HiFi-GAN, so |
| no `torch.stft` / complex tensors in the conversion path. |
| - **Decoder is fp32, not fp16.** Documented above. The mlpackage size |
| reflects this (β210 MB per bucket). |
| - **Voice-clone fidelity ceiling is architectural.** ECAPA-TDNN cosine |
| to reference clip β 0.18 here, β 0.29 in PyTorch fp32. The same-speaker |
| threshold is ~0.30. This isn't a quantization or conversion artifact; |
| see PR #46 TRIALS.md Phase 5. |
| - **No streaming.** Whole utterance only. Add chunked streaming on the |
| host side if you need it. |
|
|
| ## Citation & acknowledgments |
|
|
| - Yinghao Aaron Li et al. β StyleTTS2 architecture and LibriTTS |
| checkpoint. |
| - LibriTTS authors (CC-BY-4.0 training data). |
| - espeak-ng β phonemization frontend. |
|
|
| ```bibtex |
| @inproceedings{li2023styletts2, |
| title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
| author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima}, |
| booktitle = {NeurIPS}, |
| year = {2023} |
| } |
| ``` |
|
|