Ganda NeMo (Experimental) — Luganda TTS

A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting on-device and edge deployment on mobile phones. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.

Status: experimental. Released as-is for research, evaluation.

Model summary

Component	File	Arch	Size
Acoustic model	`luganda_fastpitch.nemo`	FastPitch (FFTransformer, 6L, d=384)	187 MB
Vocoder	`luganda_hifigan.nemo`	HiFi-GAN v1 (upsample [8,8,2,2], 512 ch)	339 MB

Language: Luganda (ISO 639-1 lg)
Sample rate: 22,050 Hz
Mel config: 80 bins, n_fft=1024, hop=256, win=1024, range 0–8000 Hz
License: Apache-2.0

Intended use

Research on low-resource African-language TTS.
Prototyping on-device / edge Luganda voice output on mobile (primary deployment target).
Benchmarking and comparison against other Luganda / Bantu-language TTS systems.

Training data

Trained on the Sunbird SALT Luganda subset — a mixed male/female multi-speaker corpus. Approximately 2,380 clips / ~2.69 hours were used.

Model architecture & training

FastPitch (acoustic model)

FFTransformer encoder/decoder: 6 layers, 1 head, d_model=384, d_inner=1536, dropout 0.1
Duration + pitch predictors: 2-layer temporal predictors, filter size 256
Learned alignment (learn_alignment: true), bin-loss warmup 100 epochs
Pitch stats (z-score): μ=190.65 Hz, σ=51.25 Hz
Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
Training steps: ~20,000
NeMo version at training time: 1.8.0rc0

HiFi-GAN (vocoder)

Generator: upsample rates [8, 8, 2, 2], kernel sizes [16, 16, 4, 4], initial channels 512
Resblock type 1, kernel sizes [3, 7, 11], dilations [[1,3,5], [1,3,5], [1,3,5]]
Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
Training steps: ~20,000
NeMo version at training time: 1.23.0

Limitations

Text frontend — English G2P. Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's EnglishPhonemesTokenizer with EnglishG2p (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.

Single output voice. The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.

Low-resource training. ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.

Text normalization. The packaged text normalizer is nemo_text_processing.text_normalization.Normalizer with lang: en. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled — pre-normalize input in your pipeline.

Usage

from nemo.collections.tts.models import FastPitchModel, HifiGanModel
import soundfile as sf

fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")

text = "Oli otya?"
parsed = fastpitch.parse(text)
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)

sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)

Loading from the Hub

from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")

fastpitch = FastPitchModel.restore_from(fp)
hifigan = HifiGanModel.restore_from(hg)

Edge / on-device deployment

The primary deployment target is mobile. Suggested paths:

Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.

Quantization, pruning, and distillation have not been applied in this release.

Ethical considerations

The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).

Attribution

Model author: Caleb Lwanga, Crane AI Labs
Base framework: NVIDIA NeMo
Training corpus: Sunbird SALT

License

Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.

Downloads last month: 37

Cal3bd3v
/

ganda-nemo-experimental