MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation
MioVocoder is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the Pupu-Vocoder (Small) from the Aliasing-Free Neural Audio Synthesis (AFGen) project.
π Overview
MioVocoder is specifically optimized to serve as the backend for MioCodec-25Hz. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for Japanese speech. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.
Key Features
- Aliasing-Free: Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
- High-Resolution: Native support for 44.1 kHz sampling rate.
- Lightweight: Based on the "Small" architecture with only 15.2M parameters, making it fast and efficient for inference.
- Multilingual Expertise: Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.
π Model Specifications
| Property | Value |
|---|---|
| Architecture | Pupu-Vocoder (Small) |
| Parameters | 15.2M |
| Sampling Rate | 44.1 kHz |
| Base Model | spellbrush/AliasingFreeNeuralAudioSynthesis |
π Training Data
The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.
| Language | Approx. Hours | Dataset |
|---|---|---|
| Japanese | ~15,000h | Various public HF datasets |
| English | ~7,500h | Libriheavy-HQ, MLS-Sidon |
| German | ~1,950h | MLS-Sidon |
| Dutch | ~1,550h | MLS-Sidon |
| French | ~1,050h | MLS-Sidon |
| Spanish | ~900h | MLS-Sidon |
| Italian | ~240h | MLS-Sidon |
| Portuguese | ~160h | MLS-Sidon |
| Polish | ~100h | MLS-Sidon |
β οΈ Limitations
As MioVocoder is highly optimized for specific use cases, please note the following:
- Language Performance: Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
- Speech-Centric: The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoderβs performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.
π Usage
Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the official codebase or via the miocodec helper library.
Integration with MioCodec
from miocodec import load_vocoder
vocoder = load_vocoder(
backend="pupu",
hf_repo="Aratako/MioVocoder",
hf_config_path="config.json",
hf_checkpoint_path="model.safetensors",
).cuda()
π Acknowledgements
- Original Architecture & Paper: Aliasing-Free Neural Audio Synthesis (AFGen).
- Base Weights: Provided by the Spellbrush team.
ποΈ Citation
If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:
Original Architecture (AFGen):
@article{afgen,
title = {Aliasing Free Neural Audio Synthesis},
author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
year = {2025},
journal = {arXiv:2512.20211},
}
MioVocoder Checkpoint:
@misc{miovocoder,
author = {Chihiro Arata},
title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
}
- Downloads last month
- 17
Model tree for Aratako/MioVocoder
Base model
spellbrush/AliasingFreeNeuralAudioSynthesis