Upload 101 files

a7fc1eb verified 4 days ago

8.48 kB

	---
	license: other
	license_name: yl4579-styletts2
	license_link: https://github.com/yl4579/StyleTTS2#pre-requisites
	language:
	- en
	library_name: coreml
	tags:
	- text-to-speech
	- styletts2
	- coreml
	- apple-silicon
	- libritts
	- on-device
	pipeline_tag: text-to-speech
	inference: false
	---

	# StyleTTS2 (LibriTTS) — CoreML

	Apple-Silicon-optimized CoreML conversion of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
	LibriTTS multi-speaker checkpoint
	([`yl4579/StyleTTS2-LibriTTS` → `Models/LibriTTS/epochs_2nd_00020.pth`](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)).

	Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on
	the text-and-prosody predictor; fp32 decoder.

	> [!IMPORTANT]
	> **These weights carry use restrictions beyond MIT. Read the License
	> section before downloading.** They are not a drop-in permissively-licensed
	> TTS model. If you need permissive terms, use
	> [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) instead.

	## License & use restrictions

	The upstream repository code is MIT, but the pre-trained LibriTTS weights
	carry two non-negotiable restrictions declared in
	[yl4579/StyleTTS2's README](https://github.com/yl4579/StyleTTS2#pre-requisites):

	1. Synthetic-origin disclosure. Any deployment that produces audio from
	these weights must clearly disclose to listeners that the audio is
	synthetic. No undisclosed synthetic-speech publishing.
	2. Speaker consent for voice cloning. Cloning a real person's voice
	requires their consent. No unauthorized celebrity / public-figure /
	non-consenting third-party voice cloning.

	These restrictions ride with the weights through every redistribution,
	fine-tune, and downstream derivative. Anyone downloading this repo inherits
	them and must propagate them in turn.

	If you cannot or will not honor these terms, **do not download these
	weights**.

	License-of-record: [github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
	upstream README at the time of conversion (see Conversion provenance below
	for the pinned commit).

	## What's in this repo

	\| Package \| Compute unit \| Precision \| Buckets \| Called \|
	\|---\|---\|---\|---\|---\|
	\| `styletts2_text_predictor_{32,64,128,256,512}.mlpackage` \| ANE \| fp16 \| 5 token-length \| 1× per utterance \|
	\| `styletts2_diffusion_step_512.mlpackage` \| CPU+GPU \| fp16 \| 1 (B=512 only) \| ~5× per utterance \|
	\| `styletts2_f0n_energy.mlpackage` \| ANE \| fp16 \| dynamic \| 1× per utterance \|
	\| `styletts2_decoder_{256,512,1024,2048,4096}.mlpackage` \| CPU+GPU \| fp32 \| 5 mel-length \| 1× per utterance \|
	\| `constants/text_cleaner_vocab.json` \| — \| — \| — \| phoneme→id table \|
	\| `config.json` \| — \| — \| — \| bundle runtime contract (audio/sampler/buckets) \|

	Total on-disk size: ~1.4 GB per format.

	Both source `.mlpackage` (uncompiled, portable across Xcode versions) and
	pre-compiled `.mlmodelc` (Apple Silicon, ready for `MLModel(contentsOf:)`)
	are shipped. The `.mlmodelc` artifacts are under `compiled/`. Pick one:

	- *`.mlpackage`** — load via `MLModel(contentsOf:)`; the OS compiles on
	first load (~5–20 s cold start the first time, cached afterward).
	- *`compiled/.mlmodelc`** — already compiled; same loader path skips the
	on-device compile. Useful for shipping inside an app bundle.

	The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the
	hard-alignment matrix (cumsum-of-durations → one-hot → matmul) live in your
	host application (Swift / Python). Per-step inference is in CoreML; control
	flow is not.

	### Why the precision split looks like this

	- text_predictor is fp16. Selective int8 PTQ was tried and dropped:
	on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of
	weight bandwidth, has no exposed int8 GEMM, and dequantizes back to
	fp16 on load. The savings did not justify the parity risk on small
	projections.
	- diffusion_step stays fp16. It runs 5 times per utterance through an
	ODE-style sampler; quantization noise compounds through iterations.
	Same lesson as PocketTTS issue #7.
	- f0n_energy stays fp16. ~6 MB. No bandwidth payoff; quantizing
	small projections injects audible pitch noise.
	- decoder is fp32, not fp16. SineGen's harmonic source accumulates
	phase via `cumsum × 2π × hop=300`, reaching magnitudes ~4000
	mid-frame. fp16 precision at that magnitude (~4) is much larger than
	the per-sample increment (~0.05 rad), which scrambles the sine output
	and produces audibly robotic synthesis. fp32 is required end-to-end.

	### Why only one diffusion bucket

	Empirically every observed `bert_dur` fits in B=512. The 32/64/128/256
	buckets were dead weight (~192 MB) given the non-linear cost ladder
	(B=32 ≈ 66 ms/step, B=512 ≈ 152 ms/step). Dropping them adds at most
	~430 ms per utterance in the worst short case.

	## Performance

	- RTFx: 4.32× warm on M-series Mac (5-step ADPM2 sampler, all buckets
	pre-warmed).
	- Log-mel cosine vs PyTorch fp32: 0.9687.
	- ECAPA-TDNN cosine to reference clip: 0.18 — at the model's
	architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the
	same metric. Voice-clone fidelity is bounded by StyleTTS2's
	architecture, not by this conversion.

	## How to use

	### Phonemizer

	espeak-ng IPA + stress. The 178-token vocabulary in
	`constants/text_cleaner_vocab.json` mirrors `text_utils.TextCleaner` from
	the upstream repo: `[pad] + punctuation + ASCII letters + IPA letters`.

	Pad token is `$` at id 0.

	### Inference shape

	```text
	text → phonemes → token ids
	│
	▼
	text_predictor (ANE, int8)
	│ ├─ d_en (1, T_dur, hidden)
	│ ├─ s_pred (1, 256) (sampler init via diffusion)
	│ └─ duration logits → duration → one-hot alignment matrix (host)
	│
	▼
	diffusion_step × 5 (CPU+GPU, fp16) (ADPM2 + Karras schedule + CFG)
	│
	▼
	[blend(s, ref_s) + alignment]
	│
	▼
	f0n_energy (ANE, fp16) → F0_curve, N
	│
	▼
	decoder (CPU+GPU, fp32) → 24 kHz waveform
	```

	The Swift host owns the sampler loop, alignment construction, and bucket
	routing. A reference Swift integration is in
	[FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).

	### Bucket routing

	Round each variable-length input up to the next bucket. Pad with zeros.

	\| Input \| Axis \| Buckets \|
	\|---\|---\|---\|
	\| text_predictor `tokens` \| T_tok \| 32 / 64 / 128 / 256 / 512 \|
	\| diffusion_step `embedding` \| T_bert \| 512 only (pad) \|
	\| decoder `asr` \| T_mel \| 256 / 512 / 1024 / 2048 / 4096 \|

	f0n_energy is shape-flexible.

	## Conversion provenance

	- Upstream code: [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
	- Upstream weights: [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS),
	file `Models/LibriTTS/epochs_2nd_00020.pth`
	- Conversion scripts: [FluidInference/mobius PR #46](https://github.com/FluidInference/mobius/pull/46)
	(`models/tts/styletts2/scripts/`)
	- Quantization: `coremltools.optimize.coreml.linear_quantize_weights`,
	`mode=linear_symmetric`, `dtype=int8`, `granularity=per_channel`,
	`weight_threshold=200_000`
	- Target: `coremltools` ≥ 8.0, `minimum_deployment_target=iOS17`
	(macOS 14+ / iOS 17+)

	## Known limitations

	- English (LibriTTS) only. No multilingual support in this
	checkpoint.
	- HiFi-GAN decoder, not iSTFTNet. LibriTTS upstream uses HiFi-GAN, so
	no `torch.stft` / complex tensors in the conversion path.
	- Decoder is fp32, not fp16. Documented above. The mlpackage size
	reflects this (≈210 MB per bucket).
	- Voice-clone fidelity ceiling is architectural. ECAPA-TDNN cosine
	to reference clip ≈ 0.18 here, ≈ 0.29 in PyTorch fp32. The same-speaker
	threshold is ~0.30. This isn't a quantization or conversion artifact;
	see PR #46 TRIALS.md Phase 5.
	- No streaming. Whole utterance only. Add chunked streaming on the
	host side if you need it.

	## Citation & acknowledgments

	- Yinghao Aaron Li et al. — StyleTTS2 architecture and LibriTTS
	checkpoint.
	- LibriTTS authors (CC-BY-4.0 training data).
	- espeak-ng — phonemization frontend.

	```bibtex
	@inproceedings{li2023styletts2,
	title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
	author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
	booktitle = {NeurIPS},
	year = {2023}
	}
	```