Safetensors
speech
audio
vocoder
Configuration Parsing Warning: Invalid JSON for config file config.json

MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation

GitHub

MioVocoder is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the Pupu-Vocoder (Small) from the Aliasing-Free Neural Audio Synthesis (AFGen) project.

🌟 Overview

MioVocoder is specifically optimized to serve as the backend for MioCodec-25Hz. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for Japanese speech. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.

Key Features

  • Aliasing-Free: Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
  • High-Resolution: Native support for 44.1 kHz sampling rate.
  • Lightweight: Based on the "Small" architecture with only 15.2M parameters, making it fast and efficient for inference.
  • Multilingual Expertise: Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.

πŸ“Š Model Specifications

Property Value
Architecture Pupu-Vocoder (Small)
Parameters 15.2M
Sampling Rate 44.1 kHz
Base Model spellbrush/AliasingFreeNeuralAudioSynthesis

πŸ“š Training Data

The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.

Language Approx. Hours Dataset
Japanese ~15,000h Various public HF datasets
English ~7,500h Libriheavy-HQ, MLS-Sidon
German ~1,950h MLS-Sidon
Dutch ~1,550h MLS-Sidon
French ~1,050h MLS-Sidon
Spanish ~900h MLS-Sidon
Italian ~240h MLS-Sidon
Portuguese ~160h MLS-Sidon
Polish ~100h MLS-Sidon

⚠️ Limitations

As MioVocoder is highly optimized for specific use cases, please note the following:

  • Language Performance: Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
  • Speech-Centric: The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.

πŸš€ Usage

Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the official codebase or via the miocodec helper library.

Integration with MioCodec

from miocodec import load_vocoder

vocoder = load_vocoder(
    backend="pupu",
    hf_repo="Aratako/MioVocoder",
    hf_config_path="config.json",
    hf_checkpoint_path="model.safetensors",
).cuda()

πŸ“œ Acknowledgements

πŸ–ŠοΈ Citation

If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:

Original Architecture (AFGen):

@article{afgen,
  title        = {Aliasing Free Neural Audio Synthesis},
  author       = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
  year         = {2025},
  journal      = {arXiv:2512.20211},
}

MioVocoder Checkpoint:

@misc{miovocoder,
  author = {Chihiro Arata},
  title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
}
Downloads last month
17
Safetensors
Model size
15.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Aratako/MioVocoder

Finetuned
(1)
this model

Datasets used to train Aratako/MioVocoder

Paper for Aratako/MioVocoder