Instructions to use FluidInference/supertonic-3-coreml with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Supertonic
How to use FluidInference/supertonic-3-coreml with Supertonic:
from supertonic import TTS tts = TTS(auto_download=True) style = tts.get_voice_style(voice_name="M1") text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance." wav, duration = tts.synthesize(text, voice_style=style) tts.save_audio(wav, "output.wav")
- Notebooks
- Google Colab
- Kaggle
Supertonic-3 β CoreML conversion
Hand-port of Supertone Supertonic-3 v1.7.3 from ONNX to PyTorch to CoreML. 31 languages, 44.1 kHz, flow-matching diffusion (8 denoising steps, classifier-free guidance baked into the ONNX graph via batch-2 duplication).
End-to-end pipeline:
text β UnicodeProcessor β token_ids, text_mask
βββ duration_predictor β duration_sec
βββ text_encoder β text_emb [B, 256, T]
β
sample_noisy_latent(duration_sec) β noisy [B, 144, L], latent_mask
β
for 8 steps: vector_estimator(noisy, text_emb, style, masks, step, total)
β
vocoder(denoised_latent) β wav [B, 512*6*L]
Audio chunk granularity:
- AE / vocoder frame: 512 / 44100 β 11.6 ms
- TTL latent slot (model "tick"): 512 Γ 6 / 44100 β 69.7 ms
Layout
models/tts/supertonic-3/
βββ README.md
βββ pyproject.toml # uv project (Python 3.11, torch + coremltools 8)
βββ coreml/
βββ trials.md # numerical-parity bug log (4 vector_estimator gotchas)
βββ __init__.py
βββ common.py # ONNX-graph loader utilities (assign_param, etc.)
βββ text_encoder.py # PyTorch port: build_text_encoder_from_onnx
βββ duration_predictor.py
βββ vector_estimator.py
βββ vocoder.py
βββ convert_coreml.py # PyTorch trace -> .mlpackage for all 4 modules
βββ validate.py # ONNX vs PyTorch parity check
βββ verify_coreml.py # CoreML vs PyTorch parity check
βββ infer.py # end-to-end PyTorch TTS driver (text -> wav)
βββ infer_coreml.py # end-to-end CoreML TTS driver (text -> wav)
Setup
cd models/tts/supertonic-3/
uv sync
# Fetch upstream ONNX + style + tokenizer assets
mkdir -p build/_onnx build/voice_styles
HF=https://huggingface.co/Supertone/supertonic-3/resolve/main
for f in text_encoder duration_predictor vector_estimator vocoder; do
curl -L $HF/_onnx/${f}.onnx -o build/_onnx/${f}.onnx
done
curl -L $HF/_onnx/tts.json -o build/_onnx/tts.json
curl -L $HF/_onnx/unicode_indexer.json -o build/_onnx/unicode_indexer.json
curl -L $HF/voice_styles/M1.json -o build/voice_styles/M1.json
Convert
# FP32 (numerical reference; ALL modules fall back to CPU on ANE)
uv run python -m coreml.convert_coreml build/_onnx --out-dir build/_mlpackage
# FP16 (required for ANE residency; 3/4 modules land on ANE β see Profile below)
uv run python -m coreml.convert_coreml build/_onnx --fp16 --out-dir build/_mlpackage_fp16
# Fixed-shape VectorEstimator variant for ANE profiling (RangeDim/Enum hit
# ANE shape limits β see trials.md "Dynamic shapes vs ANE"):
uv run python -m coreml.convert_ve_fixed \
--onnx build/_onnx/vector_estimator.onnx \
--out build/_mlpackage_fp16_fixed/VectorEstimator_L128.mlpackage \
--L 128 --T 128
Produces four .mlpackage bundles (FP32 ~380 MB, FP16 ~190 MB; mlprogram,
iOS 18+):
| Module | FP32 | FP16 | Variable axes |
|---|---|---|---|
| vocoder | 97 MB | 48 MB | latent.L_ttl = RangeDim(4..512) |
| text_encoder | 35 MB | 17 MB | fixed text.T = 128 |
| duration_predictor | 3.5 MB | 1.8 MB | fixed text.T = 128 |
| vector_estimator | 244 MB | 122 MB | latent.L & text.T = RangeDim(17..512) |
Validate
# ONNX vs PyTorch port (per module)
uv run python -m coreml.validate
# CoreML vs PyTorch port (per module)
uv run python -m coreml.verify_coreml
# End-to-end PyTorch (writes WAV)
uv run python -m coreml.infer \
--onnx-dir build/_onnx \
--voice-style build/voice_styles/M1.json \
--text "Hello world."
# End-to-end CoreML (writes WAV)
uv run python -m coreml.infer_coreml \
--mlpackage-dir build/_mlpackage \
--tts-json build/_onnx/tts.json \
--unicode-indexer build/_onnx/unicode_indexer.json \
--voice-style build/voice_styles/M1.json \
--text "Hello world."
Final parity vs ONNX-Runtime CPU:
| Module | PyTorch vs ONNX max_abs | CoreML vs PyTorch max_abs |
|---|---|---|
| vocoder | 2.53e-4 | 1.41e-6 |
| text_encoder | 9.77e-2 (relaxed tol) | 2.33e-4 |
| duration_predictor | 3.04e-6 | 3.82e-6 |
| vector_estimator | 1.21e-3 | 2.96e-5 |
End-to-end CoreML on M-series CPU+ANE: ~0.74 s to synthesize 6.32 s of audio for a single English sentence (RTFx β 8.5x), 8 denoising steps. ASR-verified against FluidAudio Parakeet TDT.
Profile (FP16, Apple M2, macOS 26.5, cpu_and_neural_engine)
| Module | CPU% | GPU% | ANE% | Predict | Notes |
|---|---|---|---|---|---|
| duration_predictor | 100 | 0 | 0 | 0.82 ms | tiny, CPU-bound |
| text_encoder (T=128) | 38 | 0 | 62 | 2.15 ms | partial ANE |
| vocoder (RangeDim L 4..512) | 0 | 0 | 100 | 1.17 ms | full ANE, 4Γ vs FP32 |
| vector_estimator (RangeDim 17..512) | β | β | β | β | dynamic shapes crash on ANE β must bucket to fixed L |
| vector_estimator (fixed L=128 T=128) | 6 | 0 | 94 | 3.8 ms | lands on ANE (M5 Pro): NE 3.82 ms vs CPU-only 14.20 ms = 3.7Γ. ANECCompile FAILED msg is non-fatal β see trials.md "M5 Pro re-evaluation" |
| vector_estimator (fixed L=256/512) | 4 | 0 | 96 | 8.4 / 16.4 ms | ANE holds across buckets; int8 halves size (64.5 MB) at same speed/parity 41.5 dB |
See coreml/trials.md β "ANE residency profiling" for the full breakdown,
the float-mask refactor that eliminated the bool-tile blocker, the
residual opaque ANECCompile() FAILED (11), and the EnumeratedShapes
runtime stride gotcha.
Critical gotchas
See coreml/trials.md for the full log. Highlights:
- CFG via batch-2 duplication β the ONNX vector_estimator tiles
inputs to batch=2, runs cond + uncond in parallel, then combines
with
(noisy + (1/total)*(4*cond - 3*uncond)) * mask. The cond style key is not the userstyle_ttlβ it is a learned initializer at/vector_estimator/Expand_output_0. - Rotary is length-normalized β
angles = (pos / sum(mask)) * theta, divisor differs for Q (latent_mask) and K (text_mask). - Attention divisor is 16.0, not
sqrt(dk)=8. Off-by-2x in scoring. - Style attention applies
tanh(K)before the score matmul; text attention does not. - Replicate-pad lower bound β ConvNeXt depthwise pads scale with
dilation:
pad = (K-1)*D/2. CoreML enforcespad β€ dim-1at load time, henceRangeDim.lower_bound = 17for vector_estimator and4for vocoder. - int32 vs int64 tokens β CoreML wants int32, PyTorch indexes int64.
Wrap modules with a tiny
_Int32Wrapperthat casts inside the traced graph so the external input stays int32. - Python 3.14 has no BlobWriter β pin
requires-python = ">=3.11,<3.13". - Float masking, not bool masking β
masked_fill(mask==0, -inf)andwhere(mask==0, 0, attn)compile tobool tile/selectops that ANE rejects. Usescores - (1.0 - mask) * 1e4(additive) andattn * mask(multiplicative) instead. Lifts vector_estimator from 89.6% β 93.0% ANE-eligible (though the residual opaqueANECCompile() FAILED (11)still blocks final ANE landing β see trials.md). - coremltools
_intcast with (1,) tensor βaten::Inton a (1,)-shape int tensor tripsTypeError: only 0-dimensional arrays can be converted to Python scalarsinside coremltools'_casthandler.convert_coreml.pymonkey-patches_cast(_patch_int_cast) to squeeze (1,) β scalar before forwarding.
Upstream + downstream
- Upstream: https://huggingface.co/Supertone/supertonic-3
- Reference Python driver: https://github.com/supertone-inc/supertonic/blob/main/py/helper.py
- Republished CoreML:
FluidInference/supertonic-3-coreml(HuggingFace) - FluidAudio Swift integration:
Sources/FluidAudio/TTS/Supertonic3/