Upload VectorEstimatorVariants/README.md

af99490 verified 19 days ago

8.51 kB

	# Supertonic-3 — CoreML conversion

	Hand-port of [Supertone Supertonic-3 v1.7.3](https://huggingface.co/Supertone/supertonic-3)
	from ONNX to PyTorch to CoreML. 31 languages, 44.1 kHz, flow-matching
	diffusion (8 denoising steps, classifier-free guidance baked into the
	ONNX graph via batch-2 duplication).

	End-to-end pipeline:

	```
	text → UnicodeProcessor → token_ids, text_mask
	├── duration_predictor → duration_sec
	└── text_encoder → text_emb [B, 256, T]
	↓
	sample_noisy_latent(duration_sec) → noisy [B, 144, L], latent_mask
	↓
	for 8 steps: vector_estimator(noisy, text_emb, style, masks, step, total)
	↓
	vocoder(denoised_latent) → wav [B, 5126L]
	```

	Audio chunk granularity:
	- AE / vocoder frame: 512 / 44100 ≈ 11.6 ms
	- TTL latent slot (model "tick"): 512 × 6 / 44100 ≈ 69.7 ms

	## Layout

	```
	models/tts/supertonic-3/
	├── README.md
	├── pyproject.toml # uv project (Python 3.11, torch + coremltools 8)
	└── coreml/
	├── trials.md # numerical-parity bug log (4 vector_estimator gotchas)
	├── __init__.py
	├── common.py # ONNX-graph loader utilities (assign_param, etc.)
	├── text_encoder.py # PyTorch port: build_text_encoder_from_onnx
	├── duration_predictor.py
	├── vector_estimator.py
	├── vocoder.py
	├── convert_coreml.py # PyTorch trace -> .mlpackage for all 4 modules
	├── validate.py # ONNX vs PyTorch parity check
	├── verify_coreml.py # CoreML vs PyTorch parity check
	├── infer.py # end-to-end PyTorch TTS driver (text -> wav)
	└── infer_coreml.py # end-to-end CoreML TTS driver (text -> wav)
	```

	## Setup

	```bash
	cd models/tts/supertonic-3/
	uv sync

	# Fetch upstream ONNX + style + tokenizer assets
	mkdir -p build/_onnx build/voice_styles
	HF=https://huggingface.co/Supertone/supertonic-3/resolve/main
	for f in text_encoder duration_predictor vector_estimator vocoder; do
	curl -L $HF/_onnx/${f}.onnx -o build/_onnx/${f}.onnx
	done
	curl -L $HF/_onnx/tts.json -o build/_onnx/tts.json
	curl -L $HF/_onnx/unicode_indexer.json -o build/_onnx/unicode_indexer.json
	curl -L $HF/voice_styles/M1.json -o build/voice_styles/M1.json
	```

	## Convert

	```bash
	# FP32 (numerical reference; ALL modules fall back to CPU on ANE)
	uv run python -m coreml.convert_coreml build/_onnx --out-dir build/_mlpackage

	# FP16 (required for ANE residency; 3/4 modules land on ANE — see Profile below)
	uv run python -m coreml.convert_coreml build/_onnx --fp16 --out-dir build/_mlpackage_fp16

	# Fixed-shape VectorEstimator variant for ANE profiling (RangeDim/Enum hit
	# ANE shape limits — see trials.md "Dynamic shapes vs ANE"):
	uv run python -m coreml.convert_ve_fixed \
	--onnx build/_onnx/vector_estimator.onnx \
	--out build/_mlpackage_fp16_fixed/VectorEstimator_L128.mlpackage \
	--L 128 --T 128
	```

	Produces four `.mlpackage` bundles (FP32 ~380 MB, FP16 ~190 MB; mlprogram,
	iOS 18+):

	\| Module \| FP32 \| FP16 \| Variable axes \|
	\| ------------------ \| ----- \| ------ \| ----------------------------------------- \|
	\| vocoder \| 97 MB \| 48 MB \| `latent.L_ttl` = RangeDim(4..512) \|
	\| text_encoder \| 35 MB \| 17 MB \| fixed `text.T = 128` \|
	\| duration_predictor \| 3.5 MB\| 1.8 MB \| fixed `text.T = 128` \|
	\| vector_estimator \| 244 MB\| 122 MB \| `latent.L` & `text.T` = RangeDim(17..512) \|

	## Validate

	```bash
	# ONNX vs PyTorch port (per module)
	uv run python -m coreml.validate

	# CoreML vs PyTorch port (per module)
	uv run python -m coreml.verify_coreml

	# End-to-end PyTorch (writes WAV)
	uv run python -m coreml.infer \
	--onnx-dir build/_onnx \
	--voice-style build/voice_styles/M1.json \
	--text "Hello world."

	# End-to-end CoreML (writes WAV)
	uv run python -m coreml.infer_coreml \
	--mlpackage-dir build/_mlpackage \
	--tts-json build/_onnx/tts.json \
	--unicode-indexer build/_onnx/unicode_indexer.json \
	--voice-style build/voice_styles/M1.json \
	--text "Hello world."
	```

	Final parity vs ONNX-Runtime CPU:

	\| Module \| PyTorch vs ONNX max_abs \| CoreML vs PyTorch max_abs \|
	\| ------------------ \| ----------------------- \| ------------------------- \|
	\| vocoder \| 2.53e-4 \| 1.41e-6 \|
	\| text_encoder \| 9.77e-2 (relaxed tol) \| 2.33e-4 \|
	\| duration_predictor \| 3.04e-6 \| 3.82e-6 \|
	\| vector_estimator \| 1.21e-3 \| 2.96e-5 \|

	End-to-end CoreML on M-series CPU+ANE: ~0.74 s to synthesize
	6.32 s of audio for a single English sentence (RTFx ≈ 8.5x), 8
	denoising steps. ASR-verified against FluidAudio Parakeet TDT.

	## Profile (FP16, Apple M2, macOS 26.5, `cpu_and_neural_engine`)

	\| Module \| CPU% \| GPU% \| ANE% \| Predict \| Notes \|
	\| ----------------------------------- \| ---- \| ---- \| ---- \| ------- \| ----- \|
	\| duration_predictor \| 100 \| 0 \| 0 \| 0.82 ms \| tiny, CPU-bound \|
	\| text_encoder (T=128) \| 38 \| 0 \| 62 \| 2.15 ms \| partial ANE \|
	\| vocoder (RangeDim L 4..512) \| 0 \| 0 \| 100 \| 1.17 ms \| full ANE, 4× vs FP32 \|
	\| vector_estimator (RangeDim 17..512) \| — \| — \| — \| — \| dynamic shapes crash on ANE — must bucket to fixed L \|
	\| vector_estimator (fixed L=128 T=128)\| 6 \| 0 \| 94 \| 3.8 ms \| lands on ANE (M5 Pro): NE 3.82 ms vs CPU-only 14.20 ms = 3.7×. `ANECCompile FAILED` msg is non-fatal — see trials.md "M5 Pro re-evaluation" \|
	\| vector_estimator (fixed L=256/512) \| 4 \| 0 \| 96 \| 8.4 / 16.4 ms \| ANE holds across buckets; int8 halves size (64.5 MB) at same speed/parity 41.5 dB \|

	See `coreml/trials.md` → "ANE residency profiling" for the full breakdown,
	the float-mask refactor that eliminated the bool-tile blocker, the
	residual opaque `ANECCompile() FAILED (11)`, and the EnumeratedShapes
	runtime stride gotcha.

	## Critical gotchas

	See `coreml/trials.md` for the full log. Highlights:

	1. CFG via batch-2 duplication — the ONNX vector_estimator tiles
	inputs to batch=2, runs cond + uncond in parallel, then combines
	with `(noisy + (1/total)(4cond - 3uncond)) mask`. The cond
	style key is not the user `style_ttl` — it is a learned
	initializer at `/vector_estimator/Expand_output_0`.
	2. Rotary is length-normalized — `angles = (pos / sum(mask)) * theta`,
	divisor differs for Q (latent_mask) and K (text_mask).
	3. Attention divisor is 16.0, not `sqrt(dk)=8`. Off-by-2x in scoring.
	4. Style attention applies `tanh(K)` before the score matmul; text
	attention does not.
	5. Replicate-pad lower bound — ConvNeXt depthwise pads scale with
	dilation: `pad = (K-1)*D/2`. CoreML enforces `pad ≤ dim-1` at load
	time, hence `RangeDim.lower_bound = 17` for vector_estimator and
	`4` for vocoder.
	6. int32 vs int64 tokens — CoreML wants int32, PyTorch indexes int64.
	Wrap modules with a tiny `_Int32Wrapper` that casts inside the
	traced graph so the external input stays int32.
	7. Python 3.14 has no BlobWriter — pin `requires-python = ">=3.11,<3.13"`.
	8. Float masking, not bool masking — `masked_fill(mask==0, -inf)` and
	`where(mask==0, 0, attn)` compile to `bool tile`/`select` ops that ANE
	rejects. Use `scores - (1.0 - mask) * 1e4` (additive) and `attn * mask`
	(multiplicative) instead. Lifts vector_estimator from 89.6% → 93.0%
	ANE-eligible (though the residual opaque `ANECCompile() FAILED (11)`
	still blocks final ANE landing — see trials.md).
	9. coremltools `_int` cast with (1,) tensor — `aten::Int` on a
	(1,)-shape int tensor trips `TypeError: only 0-dimensional arrays can
	be converted to Python scalars` inside coremltools' `_cast` handler.
	`convert_coreml.py` monkey-patches `_cast` (`_patch_int_cast`) to
	squeeze (1,) → scalar before forwarding.

	## Upstream + downstream

	- Upstream: <https://huggingface.co/Supertone/supertonic-3>
	- Reference Python driver: <https://github.com/supertone-inc/supertonic/blob/main/py/helper.py>
	- Republished CoreML: `FluidInference/supertonic-3-coreml` (HuggingFace)
	- FluidAudio Swift integration: `Sources/FluidAudio/TTS/Supertonic3/`