PrimeTTS — tiny on-device zh-TW + English TTS

A 4.09M-parameter Taiwan-Mandarin + English text-to-speech model that runs entirely on CPU — small and fast enough for Jetson-class on-device use (contact-centre, GPS, transit). One model, one young-female voice: Chinese, English, and code-mix through a single frontend (no language routing), built for entity correctness — phone numbers, emails, addresses, prices, dates, temperatures, %, serials.

🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1

Parameters 4.09M = 3.56M acoustic (incl. 97K pitch refiner) + 0.53M vocoder
Sample rate 8 kHz (on-device; a 24 kHz variant is in-repo)
Runtime onnxruntime, CPU-only, torch-free at inference
Languages zh-TW (Traditional) + English + code-mix — single voice
Voice young female, Taiwan-Mandarin accent
Measured zh-CER 0.109 · en-WER 0.083 · Jetson Nano RTF 0.35 (1 thread)
License Apache-2.0

Highlights

  • Tiny + CPU-only — ~4M params, ONNX, torch-free; real-time on a Jetson Nano (RTF 0.35, single thread).
  • One voice, three modes — zh / en / code-mix share one timbre and accent through a single frontend; no language tag needed.
  • Mandarin tones via a frame-pitch refiner — a 97K-param module turns coarse per-phoneme pitch into a per-frame F0 contour (the tone carrier). Ablating it costs +18% relative zh-CER (zh-only; English is unaffected — no lexical tone).
  • Entity-correct — a normalization layer reads numbers, dates, prices, emails, addresses, serials, and spells acronyms/letters (VIP → V-I-P), applied identically in training and at inference.

Performance — held-out (36 unseen phone-attendant sentences)

metric value
zh-CER (Breeze-ASR-25) 0.109 (pure-zh 0.119 · code-mix 0.098)
en-WER (Whisper) 0.083
Taiwan-accent gap¹ +0.033 (genuine TW accent)
SQUIM PESQ / STOI 2.22 / 0.90
On Jetson Nano (ORT-CPU, 1 thread) RTF 0.347 · on-device CER ≈ 0.117

¹ CER(generic ASR) − CER(Taiwan-tuned Breeze-ASR-25) per zh clip; >0 ⇒ a Taiwan recognizer understands it better ⇒ a real Taiwan accent is present.

On audio quality: at 8 kHz the band ceiling is 4 kHz (Nyquist), which discards the brightness/sibilance above it — clear and intelligible, but telephone-band. A 16 kHz variant (0–8 kHz, ~5.0M) is in progress for higher fidelity while staying on-device; the 24 kHz files (below) are the fullest today.

Architecture

Acoustic — MicroFastSpeech (3.56M). FastSpeech-style, no attention: depthwise gated Conv-FFN, external durations + length regulator, frame-pitch, BiGRU, postnet — plus the frame-pitch refiner (Conv1d → SiLU → Conv1d(groups=4) → SiLU → Conv1d, 97K) that builds the per-frame F0 contour = Mandarin tones.

{ "vocab_size": 256, "tone_size": 16, "lang_size": 4, "n_mels": 80,
  "hidden": 168, "encoder_layers": 5, "decoder_layers": 6, "decoder_ff_mult": 3,
  "sample_rate": 8000, "max_frames": 1000, "use_frame_pitch_refiner": true }

Vocoder — Snake-HiFiGAN. The on-device model uses the lightweight snake_8k_lite. The family — same architecture, different width / sample rate — is why a model's headline param count varies:

variant used by params rate band PESQ²
snake_8k_lite PrimeTTS (on-device) 0.53M 8 kHz 0–4 kHz 2.34
snake_8k 8 kHz, full width 1.15M 8 kHz 0–4 kHz 2.60
snake_v2mid 24 kHz variant 1.17M 24 kHz 0–12 kHz 3.23
snake_16k 16 kHz (roadmap) 1.43M 16 kHz 0–8 kHz

² SQUIM-PESQ, 5-utterance mean. The big 8 → 24 kHz jump is the sample rate (band), not the vocoder; _lite trades ~0.26 PESQ vs the full snake_8k for ~2.2× less compute (the lower Nano RTF). So the 4.09M on-device total = 3.56M acoustic + the 0.53M lite vocoder; the 24 kHz files pair the same-class acoustic with the heavier 1.17M snake_v2mid.

Frontend. g2pw (Taiwan bopomofo + polyphone disambiguation) + g2p_en (arpabet), merged into one phone sequence with per-phone language ids — zh, en, code-mix in a single pass. 88-symbol table. Entity normalization (text_norm.py) handles numbers / dates / prices / emails / serials, spells ALL-CAPS acronyms and a small brand lexicon. Text past max_frames is auto-chunked at punctuation.

Model files

v1b_8k/{acoustic_encoder,acoustic_decoder,vocoder}.onnx + meta.json   ← RECOMMENDED on-device (4.09M, 8 kHz) — the demo serves this
{acoustic_encoder,acoustic_decoder,vocoder}.onnx + meta.json          ← 24 kHz higher-fidelity variant (6.85M)
v3_4.6M/…                                                             ← legacy 24 kHz (4.63M), kept for record
scripts/        frontend, aligner, corpus-gen, train / export, eval
inflect_nano/   the trainer (acoustic.py + vocoder.py), forked from Inflect-Nano-v1 (LICENSE included)

Quickstart (CPU)

pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
import sys; sys.path.insert(0, "PrimeTTS/scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate            # numpy length-regulator

D = "PrimeTTS/v1b_8k"                                 # the on-device model
meta = json.load(open(f"{D}/meta.json"))
enc = ort.InferenceSession(f"{D}/acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession(f"{D}/acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession(f"{D}/vocoder.onnx",          providers=["CPUExecutionProvider"])

o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn, "lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
      ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])

The pipeline — encoder → numpy length-regulator → decoder → vocoder — is torch-free and runs as-is on a Jetson Nano CPU. (scripts/synth_long.py adds the punctuation auto-chunking for long text.)

Training

Distilled from a single teacher voice so zh / en / code-mix share one timbre and accent:

  • Reference voice — a young Taiwan-female speaker from Mozilla Common Voice zh-TW (CC0 / public domain, commercial-clear). ~13 s of the cleanest clips fix the accent (it comes from the reference, not prompting) and keep the model shippable.
  • TeacherVoxCPM2 (openbmb/VoxCPM2) voice-clones that reference for every line (48 kHz → resampled).
  • Corpus — Taiwan office / phone / GPS / transit register: diverse Mandarin, general + domain English, code-mix in varied positions, a large named-entity bank (TW + world places / roads / transit / companies / people / products), plus a rare-character + brand + email booster (the latest data lever).
  • ASR gate — Breeze-ASR-25 (zh / mix CER) + Whisper-medium (en WER) keep only clips matching their text; proper-noun coverage clips are trusted unfiltered.

The three levers that matter most for a tiny model: phone-level alignment (espeak phoneme-CTC + torchaudio.forced_align — sub-syllable boundaries are what separate intelligible speech from fluent babble), broad coverage + diverse code-mix, and the teacher (a student's English is only as native as its teacher's). Pipeline: teacher corpus → ASR gate → align → train vocoder → warm-start + train acoustic → export. The 8 kHz on-device model warm-starts the 24 kHz acoustic and adapts to 8 kHz — the trainer resamples audio and rescales durations internally. Full commands and a one-shot scripts/rebuild_voice.sh (swap in your own ~10 s reference clip) are in the repo.

Known characteristics & limitations

  • 8 kHz band ceiling — telephone-band brightness (the 16 kHz variant in progress addresses it).
  • Empty-rime syllables (是 / 十 / 日, the syllabic ㄭ) and isolated spelled letters (the leading "A" of a serial) are the fragile cases at this size: the frontend emits the right phones, but a ~4M acoustic renders them weakly. Cross-checking a robust (Breeze) vs strict (X-ASR) recognizer exposes this where a single CER number hides it.
  • Phrase-initial bare vowels in ultra-short isolated inputs ("二月" alone) can garble; fine in normal sentences.

Credits & licenses

This repository: Apache-2.0.

Downloads last month
10
GGUF
Model size
4.09M params
Architecture
inflect-acoustic
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luigi/PrimeTTS

Finetuned
(1)
this model

Spaces using Luigi/PrimeTTS 2