PrimeTTS — tiny on-device zh-TW + English TTS
A 4.09M-parameter Taiwan-Mandarin + English text-to-speech model that runs entirely on CPU — small and fast enough for Jetson-class on-device use (contact-centre, GPS, transit). One model, one young-female voice: Chinese, English, and code-mix through a single frontend (no language routing), built for entity correctness — phone numbers, emails, addresses, prices, dates, temperatures, %, serials.
🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1
| Parameters | 4.09M = 3.56M acoustic (incl. 97K pitch refiner) + 0.53M vocoder |
| Sample rate | 8 kHz (on-device; a 24 kHz variant is in-repo) |
| Runtime | onnxruntime, CPU-only, torch-free at inference |
| Languages | zh-TW (Traditional) + English + code-mix — single voice |
| Voice | young female, Taiwan-Mandarin accent |
| Measured | zh-CER 0.109 · en-WER 0.083 · Jetson Nano RTF 0.35 (1 thread) |
| License | Apache-2.0 |
Highlights
- Tiny + CPU-only — ~4M params, ONNX, torch-free; real-time on a Jetson Nano (RTF 0.35, single thread).
- One voice, three modes — zh / en / code-mix share one timbre and accent through a single frontend; no language tag needed.
- Mandarin tones via a frame-pitch refiner — a 97K-param module turns coarse per-phoneme pitch into a per-frame F0 contour (the tone carrier). Ablating it costs +18% relative zh-CER (zh-only; English is unaffected — no lexical tone).
- Entity-correct — a normalization layer reads numbers, dates, prices, emails, addresses, serials, and spells acronyms/letters (VIP → V-I-P), applied identically in training and at inference.
Performance — held-out (36 unseen phone-attendant sentences)
| metric | value |
|---|---|
| zh-CER (Breeze-ASR-25) | 0.109 (pure-zh 0.119 · code-mix 0.098) |
| en-WER (Whisper) | 0.083 |
| Taiwan-accent gap¹ | +0.033 (genuine TW accent) |
| SQUIM PESQ / STOI | 2.22 / 0.90 |
| On Jetson Nano (ORT-CPU, 1 thread) | RTF 0.347 · on-device CER ≈ 0.117 |
¹ CER(generic ASR) − CER(Taiwan-tuned Breeze-ASR-25) per zh clip; >0 ⇒ a Taiwan recognizer understands
it better ⇒ a real Taiwan accent is present.
On audio quality: at 8 kHz the band ceiling is 4 kHz (Nyquist), which discards the brightness/sibilance above it — clear and intelligible, but telephone-band. A 16 kHz variant (0–8 kHz, ~5.0M) is in progress for higher fidelity while staying on-device; the 24 kHz files (below) are the fullest today.
Architecture
Acoustic — MicroFastSpeech (3.56M). FastSpeech-style, no attention: depthwise gated Conv-FFN,
external durations + length regulator, frame-pitch, BiGRU, postnet — plus the frame-pitch refiner
(Conv1d → SiLU → Conv1d(groups=4) → SiLU → Conv1d, 97K) that builds the per-frame F0 contour = Mandarin tones.
{ "vocab_size": 256, "tone_size": 16, "lang_size": 4, "n_mels": 80,
"hidden": 168, "encoder_layers": 5, "decoder_layers": 6, "decoder_ff_mult": 3,
"sample_rate": 8000, "max_frames": 1000, "use_frame_pitch_refiner": true }
Vocoder — Snake-HiFiGAN. The on-device model uses the lightweight snake_8k_lite. The family — same
architecture, different width / sample rate — is why a model's headline param count varies:
| variant | used by | params | rate | band | PESQ² |
|---|---|---|---|---|---|
snake_8k_lite |
PrimeTTS (on-device) | 0.53M | 8 kHz | 0–4 kHz | 2.34 |
snake_8k |
8 kHz, full width | 1.15M | 8 kHz | 0–4 kHz | 2.60 |
snake_v2mid |
24 kHz variant | 1.17M | 24 kHz | 0–12 kHz | 3.23 |
snake_16k |
16 kHz (roadmap) | 1.43M | 16 kHz | 0–8 kHz | — |
² SQUIM-PESQ, 5-utterance mean. The big 8 → 24 kHz jump is the sample rate (band), not the vocoder;
_lite trades ~0.26 PESQ vs the full snake_8k for ~2.2× less compute (the lower Nano RTF). So the 4.09M
on-device total = 3.56M acoustic + the 0.53M lite vocoder; the 24 kHz files pair the same-class acoustic
with the heavier 1.17M snake_v2mid.
Frontend. g2pw (Taiwan bopomofo + polyphone disambiguation) + g2p_en (arpabet), merged into one phone
sequence with per-phone language ids — zh, en, code-mix in a single pass. 88-symbol table. Entity
normalization (text_norm.py) handles numbers / dates / prices / emails / serials, spells ALL-CAPS acronyms
and a small brand lexicon. Text past max_frames is auto-chunked at punctuation.
Model files
v1b_8k/{acoustic_encoder,acoustic_decoder,vocoder}.onnx + meta.json ← RECOMMENDED on-device (4.09M, 8 kHz) — the demo serves this
{acoustic_encoder,acoustic_decoder,vocoder}.onnx + meta.json ← 24 kHz higher-fidelity variant (6.85M)
v3_4.6M/… ← legacy 24 kHz (4.63M), kept for record
scripts/ frontend, aligner, corpus-gen, train / export, eval
inflect_nano/ the trainer (acoustic.py + vocoder.py), forked from Inflect-Nano-v1 (LICENSE included)
Quickstart (CPU)
pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
import sys; sys.path.insert(0, "PrimeTTS/scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate # numpy length-regulator
D = "PrimeTTS/v1b_8k" # the on-device model
meta = json.load(open(f"{D}/meta.json"))
enc = ort.InferenceSession(f"{D}/acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession(f"{D}/acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession(f"{D}/vocoder.onnx", providers=["CPUExecutionProvider"])
o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn, "lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])
The pipeline — encoder → numpy length-regulator → decoder → vocoder — is torch-free and runs as-is on a
Jetson Nano CPU. (scripts/synth_long.py adds the punctuation auto-chunking for long text.)
Training
Distilled from a single teacher voice so zh / en / code-mix share one timbre and accent:
- Reference voice — a young Taiwan-female speaker from Mozilla Common Voice zh-TW (CC0 / public domain, commercial-clear). ~13 s of the cleanest clips fix the accent (it comes from the reference, not prompting) and keep the model shippable.
- Teacher — VoxCPM2 (
openbmb/VoxCPM2) voice-clones that reference for every line (48 kHz → resampled). - Corpus — Taiwan office / phone / GPS / transit register: diverse Mandarin, general + domain English, code-mix in varied positions, a large named-entity bank (TW + world places / roads / transit / companies / people / products), plus a rare-character + brand + email booster (the latest data lever).
- ASR gate — Breeze-ASR-25 (zh / mix CER) + Whisper-medium (en WER) keep only clips matching their text; proper-noun coverage clips are trusted unfiltered.
The three levers that matter most for a tiny model: phone-level alignment (espeak phoneme-CTC +
torchaudio.forced_align — sub-syllable boundaries are what separate intelligible speech from fluent babble),
broad coverage + diverse code-mix, and the teacher (a student's English is only as native as its
teacher's). Pipeline: teacher corpus → ASR gate → align → train vocoder → warm-start + train acoustic → export. The 8 kHz on-device model warm-starts the 24 kHz acoustic and adapts to 8 kHz — the trainer resamples
audio and rescales durations internally. Full commands and a one-shot scripts/rebuild_voice.sh (swap in your
own ~10 s reference clip) are in the repo.
Known characteristics & limitations
- 8 kHz band ceiling — telephone-band brightness (the 16 kHz variant in progress addresses it).
- Empty-rime syllables (是 / 十 / 日, the syllabic ㄭ) and isolated spelled letters (the leading "A" of a serial) are the fragile cases at this size: the frontend emits the right phones, but a ~4M acoustic renders them weakly. Cross-checking a robust (Breeze) vs strict (X-ASR) recognizer exposes this where a single CER number hides it.
- Phrase-initial bare vowels in ultra-short isolated inputs ("二月" alone) can garble; fine in normal sentences.
Credits & licenses
- Base / trainer:
owensong/Inflect-Nano-v1(Apache-2.0) - Teacher:
openbmb/VoxCPM2· Reference voice: Mozilla Common Voice zh-TW (CC0 / public domain) - Gate ASR: Breeze-ASR-25 (MediaTek Research) · Whisper-medium · Aligner:
facebook/wav2vec2-lv-60-espeak-cv-ft+torchaudio.forced_align· Eval: sherpa-onnx X-ASR
This repository: Apache-2.0.
- Downloads last month
- 10
We're not able to determine the quantization variants.
Model tree for Luigi/PrimeTTS
Base model
owensong/Inflect-Nano-v1