--- license: other license_name: yl4579-styletts2 license_link: https://github.com/yl4579/StyleTTS2#pre-requisites language: - en library_name: coreml tags: - text-to-speech - styletts2 - coreml - apple-silicon - libritts - on-device pipeline_tag: text-to-speech inference: false --- # StyleTTS2 (LibriTTS) — CoreML Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) LibriTTS multi-speaker checkpoint ([`yl4579/StyleTTS2-LibriTTS` → `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)). Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on the text-and-prosody predictor; fp32 decoder. > [!IMPORTANT] > **These weights carry use restrictions beyond MIT. Read the License > section before downloading.** They are not a drop-in permissively-licensed > TTS model. If you need permissive terms, use > [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead. ## License & use restrictions The upstream repository code is MIT, but the pre-trained LibriTTS weights carry **two non-negotiable restrictions** declared in [yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites): 1. **Synthetic-origin disclosure.** Any deployment that produces audio from these weights must clearly disclose to listeners that the audio is synthetic. No undisclosed synthetic-speech publishing. 2. **Speaker consent for voice cloning.** Cloning a real person's voice requires their consent. No unauthorized celebrity / public-figure / non-consenting third-party voice cloning. These restrictions ride with the weights through every redistribution, fine-tune, and downstream derivative. Anyone downloading this repo inherits them and must propagate them in turn. If you cannot or will not honor these terms, **do not download these weights**. License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) upstream README at the time of conversion (see *Conversion provenance* below for the pinned commit). ## What's in this repo | Package | Compute unit | Precision | Buckets | Called | |---|---|---|---|---| | `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` | ANE | fp16 | 5 token-length | 1× per utterance | | `styletts2_diffusion_step_512.mlpackage` | CPU+GPU | fp16 | 1 (B=512 only) | ~5× per utterance | | `styletts2_f0n_energy.mlpackage` | ANE | fp16 | dynamic | 1× per utterance | | `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` | CPU+GPU | **fp32** | 5 mel-length | 1× per utterance | | `constants/text_cleaner_vocab.json` | — | — | — | phoneme→id table | | `config.json` | — | — | — | bundle runtime contract (audio/sampler/buckets) | Total on-disk size: ~1.4 GB per format. Both source `.mlpackage` (uncompiled, portable across Xcode versions) and pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`) are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one: - **`*.mlpackage`** — load via `MLModel(contentsOf:)`; the OS compiles on first load (~5–20 s cold start the first time, cached afterward). - **`compiled/*.mlmodelc`** — already compiled; same loader path skips the on-device compile. Useful for shipping inside an app bundle. The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the hard-alignment matrix (cumsum-of-durations → one-hot → matmul) live in your host application (Swift / Python). Per-step inference is in CoreML; control flow is not. ### Why the precision split looks like this - **text_predictor is fp16.** Selective int8 PTQ was tried and dropped: on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of weight bandwidth, has no exposed int8 GEMM, and dequantizes back to fp16 on load. The savings did not justify the parity risk on small projections. - **diffusion_step stays fp16.** It runs 5 times per utterance through an ODE-style sampler; quantization noise compounds through iterations. Same lesson as PocketTTS issue #7. - **f0n_energy stays fp16.** ~6 MB. No bandwidth payoff; quantizing small projections injects audible pitch noise. - **decoder is fp32, not fp16.** SineGen's harmonic source accumulates phase via `cumsum × 2π × hop=300`, reaching magnitudes ~4000 mid-frame. fp16 precision at that magnitude (~4) is much larger than the per-sample increment (~0.05 rad), which scrambles the sine output and produces audibly robotic synthesis. fp32 is required end-to-end. ### Why only one diffusion bucket Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256 buckets were dead weight (~192 MB) given the non-linear cost ladder (B=32 ≈ 66 ms/step, B=512 ≈ 152 ms/step). Dropping them adds at most ~430 ms per utterance in the worst short case. ## Performance - **RTFx:** 4.32× warm on M-series Mac (5-step ADPM2 sampler, all buckets pre-warmed). - **Log-mel cosine vs PyTorch fp32:** 0.9687. - **ECAPA-TDNN cosine to reference clip:** 0.18 — at the model's architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the same metric. Voice-clone fidelity is bounded by StyleTTS2's architecture, not by this conversion. ## How to use ### Phonemizer espeak-ng IPA + stress. The 178-token vocabulary in `constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`. Pad token is `$` at id 0. ### Inference shape ```text text → phonemes → token ids │ ▼ text_predictor (ANE, int8) │ ├─ d_en (1, T_dur, hidden) │ ├─ s_pred (1, 256) (sampler init via diffusion) │ └─ duration logits → duration → one-hot alignment matrix (host) │ ▼ diffusion_step × 5 (CPU+GPU, fp16) (ADPM2 + Karras schedule + CFG) │ ▼ [blend(s, ref_s) + alignment] │ ▼ f0n_energy (ANE, fp16) → F0_curve, N │ ▼ decoder (CPU+GPU, fp32) → 24 kHz waveform ``` The Swift host owns the sampler loop, alignment construction, and bucket routing. A reference Swift integration is in [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio). ### Bucket routing Round each variable-length input up to the next bucket. Pad with zeros. | Input | Axis | Buckets | |---|---|---| | text_predictor `tokens` | T_tok | 32 / 64 / 128 / 256 / 512 | | diffusion_step `embedding` | T_bert | 512 only (pad) | | decoder `asr` | T_mel | 256 / 512 / 1024 / 2048 / 4096 | f0n_energy is shape-flexible. ## Conversion provenance - **Upstream code:** [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) - **Upstream weights:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS), file `Models/LibriTTS/epochs_2nd_00020.pth` - **Conversion scripts:** [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46) (`models/tts/styletts2/scripts/`) - **Quantization:** `coremltools.optimize.coreml.linear_quantize_weights`, `mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`, `weight_threshold=200_000` - **Target:** `coremltools` ≥ 8.0, `minimum_deployment_target=iOS17` (macOS 14+ / iOS 17+) ## Known limitations - **English (LibriTTS) only.** No multilingual support in this checkpoint. - **HiFi-GAN decoder, not iSTFTNet.** LibriTTS upstream uses HiFi-GAN, so no `torch.stft` / complex tensors in the conversion path. - **Decoder is fp32, not fp16.** Documented above. The mlpackage size reflects this (≈210 MB per bucket). - **Voice-clone fidelity ceiling is architectural.** ECAPA-TDNN cosine to reference clip ≈ 0.18 here, ≈ 0.29 in PyTorch fp32. The same-speaker threshold is ~0.30. This isn't a quantization or conversion artifact; see PR #46 TRIALS.md Phase 5. - **No streaming.** Whole utterance only. Add chunked streaming on the host side if you need it. ## Citation & acknowledgments - Yinghao Aaron Li et al. — StyleTTS2 architecture and LibriTTS checkpoint. - LibriTTS authors (CC-BY-4.0 training data). - espeak-ng — phonemization frontend. ```bibtex @inproceedings{li2023styletts2, title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima}, booktitle = {NeurIPS}, year = {2023} } ```