Ganda NeMo (Experimental) β€” Luganda TTS

A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting on-device and edge deployment on mobile phones. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.

Status: experimental. Released as-is for research, evaluation.

Model summary

Component File Arch Size
Acoustic model luganda_fastpitch.nemo FastPitch (FFTransformer, 6L, d=384) 187 MB
Vocoder luganda_hifigan.nemo HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) 339 MB
  • Language: Luganda (ISO 639-1 lg)
  • Sample rate: 22,050 Hz
  • Mel config: 80 bins, n_fft=1024, hop=256, win=1024, range 0–8000 Hz
  • License: Apache-2.0

Intended use

  • Research on low-resource African-language TTS.
  • Prototyping on-device / edge Luganda voice output on mobile (primary deployment target).
  • Benchmarking and comparison against other Luganda / Bantu-language TTS systems.

Training data

Trained on the Sunbird SALT Luganda subset β€” a mixed male/female multi-speaker corpus. Approximately 2,380 clips / ~2.69 hours were used.

Model architecture & training

FastPitch (acoustic model)

  • FFTransformer encoder/decoder: 6 layers, 1 head, d_model=384, d_inner=1536, dropout 0.1
  • Duration + pitch predictors: 2-layer temporal predictors, filter size 256
  • Learned alignment (learn_alignment: true), bin-loss warmup 100 epochs
  • Pitch stats (z-score): ΞΌ=190.65 Hz, Οƒ=51.25 Hz
  • Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
  • Training steps: ~20,000
  • NeMo version at training time: 1.8.0rc0

HiFi-GAN (vocoder)

  • Generator: upsample rates [8, 8, 2, 2], kernel sizes [16, 16, 4, 4], initial channels 512
  • Resblock type 1, kernel sizes [3, 7, 11], dilations [[1,3,5], [1,3,5], [1,3,5]]
  • Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
  • Training steps: ~20,000
  • NeMo version at training time: 1.23.0

Limitations

Text frontend β€” English G2P. Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's EnglishPhonemesTokenizer with EnglishG2p (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.

Single output voice. The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.

Low-resource training. ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.

Text normalization. The packaged text normalizer is nemo_text_processing.text_normalization.Normalizer with lang: en. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled β€” pre-normalize input in your pipeline.

Usage

from nemo.collections.tts.models import FastPitchModel, HifiGanModel
import soundfile as sf

fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")

text = "Oli otya?"
parsed = fastpitch.parse(text)
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)

sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)

Loading from the Hub

from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")

fastpitch = FastPitchModel.restore_from(fp)
hifigan = HifiGanModel.restore_from(hg)

Edge / on-device deployment

The primary deployment target is mobile. Suggested paths:

  • Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
  • HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
  • Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.

Quantization, pruning, and distillation have not been applied in this release.

Ethical considerations

  • The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
  • Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).

Attribution

License

Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Cal3bd3v/ganda-nemo-experimental