Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means
Paper • 2601.19781 • Published
Phonological Tokenizer is a single-codebook speech tokenizer that encodes linguistic and prosodic information. The tokenizer has intermediate properties between phonetic tokens and acoustic tokens.
This tokenizer is obtained by fine-tuning the phonetic tokens derived from an SSL model (wavlm-large) using differentiable k-means in a multi-task manner with ASR and speech reconstruction. In this repository, we release the fine-tuned SSL model and cluster centroids, along with simple inference code.
For more details, please refer to our paper.
git clone https://huggingface.co/Sony/Phonological-Tokenizer
cd Phonological-Tokenizer
pip install -r requirements.txt
python inference.py [audio file path]
This model is licensed under CC BY-SA 3.0. See the LICENSE file for details.
@inproceedings{onda2026phonological,
title={Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means},
author={Onda, Kentaro and Futami, Hayato and Kashiwagi, Yosuke and Tsunoo, Emiru and Watanabe, Shinji},
booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={17817-17821},
year={2026},
organization={IEEE},
doi={10.1109/ICASSP55912.2026.11464405}
}
ondakentaro[at]gavo.t.u-tokyo.ac.jp; hayato.Futami[at]sony.com; Yosuke.Kashiwagi[at]sony.com