Configuration Parsing Warning: Invalid JSON for config file config.json

MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation

MioVocoder is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the Pupu-Vocoder (Small) from the Aliasing-Free Neural Audio Synthesis (AFGen) project.

🌟 Overview

MioVocoder is specifically optimized to serve as the backend for MioCodec-25Hz. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for Japanese speech. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.

Key Features

Aliasing-Free: Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
High-Resolution: Native support for 44.1 kHz sampling rate.
Lightweight: Based on the "Small" architecture with only 15.2M parameters, making it fast and efficient for inference.
Multilingual Expertise: Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.

📊 Model Specifications

Property	Value
Architecture	Pupu-Vocoder (Small)
Parameters	15.2M
Sampling Rate	44.1 kHz
Base Model	spellbrush/AliasingFreeNeuralAudioSynthesis

📚 Training Data

The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.

Language	Approx. Hours	Dataset
Japanese	~15,000h	Various public HF datasets
English	~7,500h	Libriheavy-HQ, MLS-Sidon
German	~1,950h	MLS-Sidon
Dutch	~1,550h	MLS-Sidon
French	~1,050h	MLS-Sidon
Spanish	~900h	MLS-Sidon
Italian	~240h	MLS-Sidon
Portuguese	~160h	MLS-Sidon
Polish	~100h	MLS-Sidon

⚠️ Limitations

As MioVocoder is highly optimized for specific use cases, please note the following:

Language Performance: Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
Speech-Centric: The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.

🚀 Usage

Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the official codebase or via the miocodec helper library.

Integration with MioCodec

from miocodec import load_vocoder

vocoder = load_vocoder(
    backend="pupu",
    hf_repo="Aratako/MioVocoder",
    hf_config_path="config.json",
    hf_checkpoint_path="model.safetensors",
).cuda()

📜 Acknowledgements

Original Architecture & Paper: Aliasing-Free Neural Audio Synthesis (AFGen).
Base Weights: Provided by the Spellbrush team.

🖊️ Citation

If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:

Original Architecture (AFGen):

@article{afgen,
  title        = {Aliasing Free Neural Audio Synthesis},
  author       = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
  year         = {2025},
  journal      = {arXiv:2512.20211},
}

MioVocoder Checkpoint:

@misc{miovocoder,
  author = {Chihiro Arata},
  title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
}