alexwengg's picture
Upload VectorEstimatorVariants/README.md
af99490 verified
|
Raw
History Blame Contribute Delete
8.51 kB

Supertonic-3 β€” CoreML conversion

Hand-port of Supertone Supertonic-3 v1.7.3 from ONNX to PyTorch to CoreML. 31 languages, 44.1 kHz, flow-matching diffusion (8 denoising steps, classifier-free guidance baked into the ONNX graph via batch-2 duplication).

End-to-end pipeline:

text β†’ UnicodeProcessor β†’ token_ids, text_mask
   β”œβ”€β”€ duration_predictor β†’ duration_sec
   └── text_encoder       β†’ text_emb [B, 256, T]
                          ↓
        sample_noisy_latent(duration_sec) β†’ noisy [B, 144, L], latent_mask
                          ↓
        for 8 steps: vector_estimator(noisy, text_emb, style, masks, step, total)
                          ↓
                       vocoder(denoised_latent) β†’ wav [B, 512*6*L]

Audio chunk granularity:

  • AE / vocoder frame: 512 / 44100 β‰ˆ 11.6 ms
  • TTL latent slot (model "tick"): 512 Γ— 6 / 44100 β‰ˆ 69.7 ms

Layout

models/tts/supertonic-3/
β”œβ”€β”€ README.md
β”œβ”€β”€ pyproject.toml          # uv project (Python 3.11, torch + coremltools 8)
└── coreml/
    β”œβ”€β”€ trials.md           # numerical-parity bug log (4 vector_estimator gotchas)
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ common.py           # ONNX-graph loader utilities (assign_param, etc.)
    β”œβ”€β”€ text_encoder.py     # PyTorch port: build_text_encoder_from_onnx
    β”œβ”€β”€ duration_predictor.py
    β”œβ”€β”€ vector_estimator.py
    β”œβ”€β”€ vocoder.py
    β”œβ”€β”€ convert_coreml.py   # PyTorch trace -> .mlpackage for all 4 modules
    β”œβ”€β”€ validate.py         # ONNX vs PyTorch parity check
    β”œβ”€β”€ verify_coreml.py    # CoreML vs PyTorch parity check
    β”œβ”€β”€ infer.py            # end-to-end PyTorch TTS driver (text -> wav)
    └── infer_coreml.py     # end-to-end CoreML TTS driver (text -> wav)

Setup

cd models/tts/supertonic-3/
uv sync

# Fetch upstream ONNX + style + tokenizer assets
mkdir -p build/_onnx build/voice_styles
HF=https://huggingface.co/Supertone/supertonic-3/resolve/main
for f in text_encoder duration_predictor vector_estimator vocoder; do
    curl -L $HF/_onnx/${f}.onnx -o build/_onnx/${f}.onnx
done
curl -L $HF/_onnx/tts.json -o build/_onnx/tts.json
curl -L $HF/_onnx/unicode_indexer.json -o build/_onnx/unicode_indexer.json
curl -L $HF/voice_styles/M1.json -o build/voice_styles/M1.json

Convert

# FP32 (numerical reference; ALL modules fall back to CPU on ANE)
uv run python -m coreml.convert_coreml build/_onnx --out-dir build/_mlpackage

# FP16 (required for ANE residency; 3/4 modules land on ANE β€” see Profile below)
uv run python -m coreml.convert_coreml build/_onnx --fp16 --out-dir build/_mlpackage_fp16

# Fixed-shape VectorEstimator variant for ANE profiling (RangeDim/Enum hit
# ANE shape limits β€” see trials.md "Dynamic shapes vs ANE"):
uv run python -m coreml.convert_ve_fixed \
    --onnx build/_onnx/vector_estimator.onnx \
    --out  build/_mlpackage_fp16_fixed/VectorEstimator_L128.mlpackage \
    --L 128 --T 128

Produces four .mlpackage bundles (FP32 ~380 MB, FP16 ~190 MB; mlprogram, iOS 18+):

Module FP32 FP16 Variable axes
vocoder 97 MB 48 MB latent.L_ttl = RangeDim(4..512)
text_encoder 35 MB 17 MB fixed text.T = 128
duration_predictor 3.5 MB 1.8 MB fixed text.T = 128
vector_estimator 244 MB 122 MB latent.L & text.T = RangeDim(17..512)

Validate

# ONNX vs PyTorch port (per module)
uv run python -m coreml.validate

# CoreML vs PyTorch port (per module)
uv run python -m coreml.verify_coreml

# End-to-end PyTorch (writes WAV)
uv run python -m coreml.infer \
    --onnx-dir build/_onnx \
    --voice-style build/voice_styles/M1.json \
    --text "Hello world."

# End-to-end CoreML (writes WAV)
uv run python -m coreml.infer_coreml \
    --mlpackage-dir build/_mlpackage \
    --tts-json build/_onnx/tts.json \
    --unicode-indexer build/_onnx/unicode_indexer.json \
    --voice-style build/voice_styles/M1.json \
    --text "Hello world."

Final parity vs ONNX-Runtime CPU:

Module PyTorch vs ONNX max_abs CoreML vs PyTorch max_abs
vocoder 2.53e-4 1.41e-6
text_encoder 9.77e-2 (relaxed tol) 2.33e-4
duration_predictor 3.04e-6 3.82e-6
vector_estimator 1.21e-3 2.96e-5

End-to-end CoreML on M-series CPU+ANE: ~0.74 s to synthesize 6.32 s of audio for a single English sentence (RTFx β‰ˆ 8.5x), 8 denoising steps. ASR-verified against FluidAudio Parakeet TDT.

Profile (FP16, Apple M2, macOS 26.5, cpu_and_neural_engine)

Module CPU% GPU% ANE% Predict Notes
duration_predictor 100 0 0 0.82 ms tiny, CPU-bound
text_encoder (T=128) 38 0 62 2.15 ms partial ANE
vocoder (RangeDim L 4..512) 0 0 100 1.17 ms full ANE, 4Γ— vs FP32
vector_estimator (RangeDim 17..512) β€” β€” β€” β€” dynamic shapes crash on ANE β€” must bucket to fixed L
vector_estimator (fixed L=128 T=128) 6 0 94 3.8 ms lands on ANE (M5 Pro): NE 3.82 ms vs CPU-only 14.20 ms = 3.7Γ—. ANECCompile FAILED msg is non-fatal β€” see trials.md "M5 Pro re-evaluation"
vector_estimator (fixed L=256/512) 4 0 96 8.4 / 16.4 ms ANE holds across buckets; int8 halves size (64.5 MB) at same speed/parity 41.5 dB

See coreml/trials.md β†’ "ANE residency profiling" for the full breakdown, the float-mask refactor that eliminated the bool-tile blocker, the residual opaque ANECCompile() FAILED (11), and the EnumeratedShapes runtime stride gotcha.

Critical gotchas

See coreml/trials.md for the full log. Highlights:

  1. CFG via batch-2 duplication β€” the ONNX vector_estimator tiles inputs to batch=2, runs cond + uncond in parallel, then combines with (noisy + (1/total)*(4*cond - 3*uncond)) * mask. The cond style key is not the user style_ttl β€” it is a learned initializer at /vector_estimator/Expand_output_0.
  2. Rotary is length-normalized β€” angles = (pos / sum(mask)) * theta, divisor differs for Q (latent_mask) and K (text_mask).
  3. Attention divisor is 16.0, not sqrt(dk)=8. Off-by-2x in scoring.
  4. Style attention applies tanh(K) before the score matmul; text attention does not.
  5. Replicate-pad lower bound β€” ConvNeXt depthwise pads scale with dilation: pad = (K-1)*D/2. CoreML enforces pad ≀ dim-1 at load time, hence RangeDim.lower_bound = 17 for vector_estimator and 4 for vocoder.
  6. int32 vs int64 tokens β€” CoreML wants int32, PyTorch indexes int64. Wrap modules with a tiny _Int32Wrapper that casts inside the traced graph so the external input stays int32.
  7. Python 3.14 has no BlobWriter β€” pin requires-python = ">=3.11,<3.13".
  8. Float masking, not bool masking β€” masked_fill(mask==0, -inf) and where(mask==0, 0, attn) compile to bool tile/select ops that ANE rejects. Use scores - (1.0 - mask) * 1e4 (additive) and attn * mask (multiplicative) instead. Lifts vector_estimator from 89.6% β†’ 93.0% ANE-eligible (though the residual opaque ANECCompile() FAILED (11) still blocks final ANE landing β€” see trials.md).
  9. coremltools _int cast with (1,) tensor β€” aten::Int on a (1,)-shape int tensor trips TypeError: only 0-dimensional arrays can be converted to Python scalars inside coremltools' _cast handler. convert_coreml.py monkey-patches _cast (_patch_int_cast) to squeeze (1,) β†’ scalar before forwarding.

Upstream + downstream