# Supertonic-3 — CoreML conversion

Hand-port of [Supertone Supertonic-3 v1.7.3](https://huggingface.co/Supertone/supertonic-3)
from ONNX to PyTorch to CoreML. 31 languages, 44.1 kHz, flow-matching
diffusion (8 denoising steps, classifier-free guidance baked into the
ONNX graph via batch-2 duplication).

End-to-end pipeline:

```
text → UnicodeProcessor → token_ids, text_mask
   ├── duration_predictor → duration_sec
   └── text_encoder       → text_emb [B, 256, T]
                          ↓
        sample_noisy_latent(duration_sec) → noisy [B, 144, L], latent_mask
                          ↓
        for 8 steps: vector_estimator(noisy, text_emb, style, masks, step, total)
                          ↓
                       vocoder(denoised_latent) → wav [B, 512*6*L]
```

Audio chunk granularity:
- AE / vocoder frame: 512 / 44100 ≈ **11.6 ms**
- TTL latent slot (model "tick"): 512 × 6 / 44100 ≈ **69.7 ms**

## Layout

```
models/tts/supertonic-3/
├── README.md
├── pyproject.toml          # uv project (Python 3.11, torch + coremltools 8)
└── coreml/
    ├── trials.md           # numerical-parity bug log (4 vector_estimator gotchas)
    ├── __init__.py
    ├── common.py           # ONNX-graph loader utilities (assign_param, etc.)
    ├── text_encoder.py     # PyTorch port: build_text_encoder_from_onnx
    ├── duration_predictor.py
    ├── vector_estimator.py
    ├── vocoder.py
    ├── convert_coreml.py   # PyTorch trace -> .mlpackage for all 4 modules
    ├── validate.py         # ONNX vs PyTorch parity check
    ├── verify_coreml.py    # CoreML vs PyTorch parity check
    ├── infer.py            # end-to-end PyTorch TTS driver (text -> wav)
    └── infer_coreml.py     # end-to-end CoreML TTS driver (text -> wav)
```

## Setup

```bash
cd models/tts/supertonic-3/
uv sync

# Fetch upstream ONNX + style + tokenizer assets
mkdir -p build/_onnx build/voice_styles
HF=https://huggingface.co/Supertone/supertonic-3/resolve/main
for f in text_encoder duration_predictor vector_estimator vocoder; do
    curl -L $HF/_onnx/${f}.onnx -o build/_onnx/${f}.onnx
done
curl -L $HF/_onnx/tts.json -o build/_onnx/tts.json
curl -L $HF/_onnx/unicode_indexer.json -o build/_onnx/unicode_indexer.json
curl -L $HF/voice_styles/M1.json -o build/voice_styles/M1.json
```

## Convert

```bash
# FP32 (numerical reference; ALL modules fall back to CPU on ANE)
uv run python -m coreml.convert_coreml build/_onnx --out-dir build/_mlpackage

# FP16 (required for ANE residency; 3/4 modules land on ANE — see Profile below)
uv run python -m coreml.convert_coreml build/_onnx --fp16 --out-dir build/_mlpackage_fp16

# Fixed-shape VectorEstimator variant for ANE profiling (RangeDim/Enum hit
# ANE shape limits — see trials.md "Dynamic shapes vs ANE"):
uv run python -m coreml.convert_ve_fixed \
    --onnx build/_onnx/vector_estimator.onnx \
    --out  build/_mlpackage_fp16_fixed/VectorEstimator_L128.mlpackage \
    --L 128 --T 128
```

Produces four `.mlpackage` bundles (FP32 ~380 MB, FP16 ~190 MB; mlprogram,
iOS 18+):

| Module             | FP32  | FP16   | Variable axes                             |
| ------------------ | ----- | ------ | ----------------------------------------- |
| vocoder            | 97 MB | 48 MB  | `latent.L_ttl` = RangeDim(4..512)         |
| text_encoder       | 35 MB | 17 MB  | fixed `text.T = 128`                      |
| duration_predictor | 3.5 MB| 1.8 MB | fixed `text.T = 128`                      |
| vector_estimator   | 244 MB| 122 MB | `latent.L` & `text.T` = RangeDim(17..512) |

## Validate

```bash
# ONNX vs PyTorch port (per module)
uv run python -m coreml.validate

# CoreML vs PyTorch port (per module)
uv run python -m coreml.verify_coreml

# End-to-end PyTorch (writes WAV)
uv run python -m coreml.infer \
    --onnx-dir build/_onnx \
    --voice-style build/voice_styles/M1.json \
    --text "Hello world."

# End-to-end CoreML (writes WAV)
uv run python -m coreml.infer_coreml \
    --mlpackage-dir build/_mlpackage \
    --tts-json build/_onnx/tts.json \
    --unicode-indexer build/_onnx/unicode_indexer.json \
    --voice-style build/voice_styles/M1.json \
    --text "Hello world."
```

Final parity vs ONNX-Runtime CPU:

| Module             | PyTorch vs ONNX max_abs | CoreML vs PyTorch max_abs |
| ------------------ | ----------------------- | ------------------------- |
| vocoder            | 2.53e-4                 | 1.41e-6                   |
| text_encoder       | 9.77e-2 (relaxed tol)   | 2.33e-4                   |
| duration_predictor | 3.04e-6                 | 3.82e-6                   |
| vector_estimator   | 1.21e-3                 | 2.96e-5                   |

End-to-end CoreML on M-series CPU+ANE: **~0.74 s** to synthesize
6.32 s of audio for a single English sentence (RTFx ≈ 8.5x), 8
denoising steps. ASR-verified against FluidAudio Parakeet TDT.

## Profile (FP16, Apple M2, macOS 26.5, `cpu_and_neural_engine`)

| Module                              | CPU% | GPU% | ANE% | Predict | Notes |
| ----------------------------------- | ---- | ---- | ---- | ------- | ----- |
| duration_predictor                  | 100  | 0    | 0    | 0.82 ms | tiny, CPU-bound |
| text_encoder (T=128)                | 38   | 0    | 62   | 2.15 ms | partial ANE |
| vocoder (RangeDim L 4..512)         | 0    | 0    | 100  | 1.17 ms | full ANE, 4× vs FP32 |
| vector_estimator (RangeDim 17..512) | —    | —    | —    | —       | dynamic shapes crash on ANE — must bucket to fixed L |
| vector_estimator (fixed L=128 T=128)| 6    | 0    | 94   | 3.8 ms  | **lands on ANE** (M5 Pro): NE 3.82 ms vs CPU-only 14.20 ms = 3.7×. `ANECCompile FAILED` msg is non-fatal — see trials.md "M5 Pro re-evaluation" |
| vector_estimator (fixed L=256/512)  | 4    | 0    | 96   | 8.4 / 16.4 ms | ANE holds across buckets; int8 halves size (64.5 MB) at same speed/parity 41.5 dB |

See `coreml/trials.md` → "ANE residency profiling" for the full breakdown,
the float-mask refactor that eliminated the bool-tile blocker, the
residual opaque `ANECCompile() FAILED (11)`, and the EnumeratedShapes
runtime stride gotcha.

## Critical gotchas

See `coreml/trials.md` for the full log. Highlights:

1. **CFG via batch-2 duplication** — the ONNX vector_estimator tiles
   inputs to batch=2, runs cond + uncond in parallel, then combines
   with `(noisy + (1/total)*(4*cond - 3*uncond)) * mask`. The cond
   style key is **not** the user `style_ttl` — it is a learned
   initializer at `/vector_estimator/Expand_output_0`.
2. **Rotary is length-normalized** — `angles = (pos / sum(mask)) * theta`,
   divisor differs for Q (latent_mask) and K (text_mask).
3. **Attention divisor is 16.0**, not `sqrt(dk)=8`. Off-by-2x in scoring.
4. **Style attention applies `tanh(K)`** before the score matmul; text
   attention does not.
5. **Replicate-pad lower bound** — ConvNeXt depthwise pads scale with
   dilation: `pad = (K-1)*D/2`. CoreML enforces `pad ≤ dim-1` at load
   time, hence `RangeDim.lower_bound = 17` for vector_estimator and
   `4` for vocoder.
6. **int32 vs int64 tokens** — CoreML wants int32, PyTorch indexes int64.
   Wrap modules with a tiny `_Int32Wrapper` that casts inside the
   traced graph so the external input stays int32.
7. **Python 3.14 has no BlobWriter** — pin `requires-python = ">=3.11,<3.13"`.
8. **Float masking, not bool masking** — `masked_fill(mask==0, -inf)` and
   `where(mask==0, 0, attn)` compile to `bool tile`/`select` ops that ANE
   rejects. Use `scores - (1.0 - mask) * 1e4` (additive) and `attn * mask`
   (multiplicative) instead. Lifts vector_estimator from 89.6% → 93.0%
   ANE-eligible (though the residual opaque `ANECCompile() FAILED (11)`
   still blocks final ANE landing — see trials.md).
9. **coremltools `_int` cast with (1,) tensor** — `aten::Int` on a
   (1,)-shape int tensor trips `TypeError: only 0-dimensional arrays can
   be converted to Python scalars` inside coremltools' `_cast` handler.
   `convert_coreml.py` monkey-patches `_cast` (`_patch_int_cast`) to
   squeeze (1,) → scalar before forwarding.

## Upstream + downstream

- Upstream: <https://huggingface.co/Supertone/supertonic-3>
- Reference Python driver: <https://github.com/supertone-inc/supertonic/blob/main/py/helper.py>
- Republished CoreML: `FluidInference/supertonic-3-coreml` (HuggingFace)
- FluidAudio Swift integration: `Sources/FluidAudio/TTS/Supertonic3/`