PrimeTTS — tiny on-device zh-TW + English TTS

A 4.09M-parameter Taiwan-Mandarin + English text-to-speech model that runs entirely on CPU — small and fast enough for Jetson-class on-device use (contact-centre, GPS, transit). One model, one young-female voice: Chinese, English, and code-mix through a single frontend (no language routing), built for entity correctness — phone numbers, emails, addresses, prices, dates, temperatures, %, serials.

🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1


Parameters	4.09M = 3.56M acoustic (incl. 97K pitch refiner) + 0.53M vocoder
Sample rate	8 kHz (on-device; a 24 kHz variant is in-repo)
Runtime	`onnxruntime`, CPU-only, torch-free at inference
Languages	zh-TW (Traditional) + English + code-mix — single voice
Voice	young female, Taiwan-Mandarin accent
Measured	zh-CER 0.109 · en-WER 0.083 · Jetson Nano RTF 0.35 (1 thread)
License	Apache-2.0

Highlights

Tiny + CPU-only — ~4M params, ONNX, torch-free; real-time on a Jetson Nano (RTF 0.35, single thread).
One voice, three modes — zh / en / code-mix share one timbre and accent through a single frontend; no language tag needed.
Mandarin tones via a frame-pitch refiner — a 97K-param module turns coarse per-phoneme pitch into a per-frame F0 contour (the tone carrier). Ablating it costs +18% relative zh-CER (zh-only; English is unaffected — no lexical tone).
Entity-correct — a normalization layer reads numbers, dates, prices, emails, addresses, serials, and spells acronyms/letters (VIP → V-I-P), applied identically in training and at inference.

Performance — held-out (36 unseen phone-attendant sentences)

metric	value
zh-CER (Breeze-ASR-25)	0.109 (pure-zh 0.119 · code-mix 0.098)
en-WER (Whisper)	0.083
Taiwan-accent gap¹	+0.033 (genuine TW accent)
SQUIM PESQ / STOI	2.22 / 0.90
On Jetson Nano (ORT-CPU, 1 thread)	RTF 0.347 · on-device CER ≈ 0.117

¹ CER(generic ASR) − CER(Taiwan-tuned Breeze-ASR-25) per zh clip; >0 ⇒ a Taiwan recognizer understands it better ⇒ a real Taiwan accent is present.

On audio quality: at 8 kHz the band ceiling is 4 kHz (Nyquist), which discards the brightness/sibilance above it — clear and intelligible, but telephone-band. A 16 kHz variant (0–8 kHz, ~5.0M) is in progress for higher fidelity while staying on-device; the 24 kHz files (below) are the fullest today.

Architecture

Acoustic — MicroFastSpeech (3.56M). FastSpeech-style, no attention: depthwise gated Conv-FFN, external durations + length regulator, frame-pitch, BiGRU, postnet — plus the frame-pitch refiner (Conv1d → SiLU → Conv1d(groups=4) → SiLU → Conv1d, 97K) that builds the per-frame F0 contour = Mandarin tones.

{ "vocab_size": 256, "tone_size": 16, "lang_size": 4, "n_mels": 80,
  "hidden": 168, "encoder_layers": 5, "decoder_layers": 6, "decoder_ff_mult": 3,
  "sample_rate": 8000, "max_frames": 1000, "use_frame_pitch_refiner": true }

Vocoder — Snake-HiFiGAN. The on-device model uses the lightweight snake_8k_lite. The family — same architecture, different width / sample rate — is why a model's headline param count varies:

variant	used by	params	rate	band	PESQ²
`snake_8k_lite`	PrimeTTS (on-device)	0.53M	8 kHz	0–4 kHz	2.34
`snake_8k`	8 kHz, full width	1.15M	8 kHz	0–4 kHz	2.60
`snake_v2mid`	24 kHz variant	1.17M	24 kHz	0–12 kHz	3.23
`snake_16k`	16 kHz (roadmap)	1.43M	16 kHz	0–8 kHz	—

² SQUIM-PESQ, 5-utterance mean. The big 8 → 24 kHz jump is the sample rate (band), not the vocoder; _lite trades ~0.26 PESQ vs the full snake_8k for ~2.2× less compute (the lower Nano RTF). So the 4.09M on-device total = 3.56M acoustic + the 0.53M lite vocoder; the 24 kHz files pair the same-class acoustic with the heavier 1.17M snake_v2mid.

Frontend. g2pw (Taiwan bopomofo + polyphone disambiguation) + g2p_en (arpabet), merged into one phone sequence with per-phone language ids — zh, en, code-mix in a single pass. 88-symbol table. Entity normalization (text_norm.py) handles numbers / dates / prices / emails / serials, spells ALL-CAPS acronyms and a small brand lexicon. Text past max_frames is auto-chunked at punctuation.

Model files

v1b_8k/{acoustic_encoder,acoustic_decoder,vocoder}.onnx + meta.json   ← RECOMMENDED on-device (4.09M, 8 kHz) — the demo serves this
{acoustic_encoder,acoustic_decoder,vocoder}.onnx + meta.json          ← 24 kHz higher-fidelity variant (6.85M)
v3_4.6M/…                                                             ← legacy 24 kHz (4.63M), kept for record
scripts/        frontend, aligner, corpus-gen, train / export, eval
inflect_nano/   the trainer (acoustic.py + vocoder.py), forked from Inflect-Nano-v1 (LICENSE included)

Quickstart (CPU)

pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS

import sys; sys.path.insert(0, "PrimeTTS/scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate            # numpy length-regulator

D = "PrimeTTS/v1b_8k"                                 # the on-device model
meta = json.load(open(f"{D}/meta.json"))
enc = ort.InferenceSession(f"{D}/acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession(f"{D}/acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession(f"{D}/vocoder.onnx",          providers=["CPUExecutionProvider"])

o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn, "lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
      ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])

The pipeline — encoder → numpy length-regulator → decoder → vocoder — is torch-free and runs as-is on a Jetson Nano CPU. (scripts/synth_long.py adds the punctuation auto-chunking for long text.)

Training

Distilled from a single teacher voice so zh / en / code-mix share one timbre and accent:

Reference voice — a young Taiwan-female speaker from Mozilla Common Voice zh-TW (CC0 / public domain, commercial-clear). ~13 s of the cleanest clips fix the accent (it comes from the reference, not prompting) and keep the model shippable.
Teacher — VoxCPM2 (openbmb/VoxCPM2) voice-clones that reference for every line (48 kHz → resampled).
Corpus — Taiwan office / phone / GPS / transit register: diverse Mandarin, general + domain English, code-mix in varied positions, a large named-entity bank (TW + world places / roads / transit / companies / people / products), plus a rare-character + brand + email booster (the latest data lever).
ASR gate — Breeze-ASR-25 (zh / mix CER) + Whisper-medium (en WER) keep only clips matching their text; proper-noun coverage clips are trusted unfiltered.

The three levers that matter most for a tiny model: phone-level alignment (espeak phoneme-CTC + torchaudio.forced_align — sub-syllable boundaries are what separate intelligible speech from fluent babble), broad coverage + diverse code-mix, and the teacher (a student's English is only as native as its teacher's). Pipeline: teacher corpus → ASR gate → align → train vocoder → warm-start + train acoustic → export. The 8 kHz on-device model warm-starts the 24 kHz acoustic and adapts to 8 kHz — the trainer resamples audio and rescales durations internally. Full commands and a one-shot scripts/rebuild_voice.sh (swap in your own ~10 s reference clip) are in the repo.

Known characteristics & limitations

8 kHz band ceiling — telephone-band brightness (the 16 kHz variant in progress addresses it).
Empty-rime syllables (是 / 十 / 日, the syllabic ㄭ) and isolated spelled letters (the leading "A" of a serial) are the fragile cases at this size: the frontend emits the right phones, but a ~4M acoustic renders them weakly. Cross-checking a robust (Breeze) vs strict (X-ASR) recognizer exposes this where a single CER number hides it.
Phrase-initial bare vowels in ultra-short isolated inputs ("二月" alone) can garble; fine in normal sentences.

Credits & licenses

Base / trainer: owensong/Inflect-Nano-v1 (Apache-2.0)
Teacher: openbmb/VoxCPM2 · Reference voice: Mozilla Common Voice zh-TW (CC0 / public domain)
Gate ASR: Breeze-ASR-25 (MediaTek Research) · Whisper-medium · Aligner: facebook/wav2vec2-lv-60-espeak-cv-ft + torchaudio.forced_align · Eval: sherpa-onnx X-ASR

This repository: Apache-2.0.

Downloads last month: 10

GGUF

Model size

4.09M params

Architecture

inflect-acoustic

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Luigi/PrimeTTS

Base model

owensong/Inflect-Nano-v1

Finetuned

(1)

this model

Luigi
/

PrimeTTS

PrimeTTS — tiny on-device zh-TW + English TTS

Highlights

Performance — held-out (36 unseen phone-attendant sentences)

Architecture

Model files

Quickstart (CPU)

Training

Known characteristics & limitations

Credits & licenses

Model tree for Luigi/PrimeTTS

Spaces using Luigi/PrimeTTS 2