English

Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means

[paper] [demo]

arch

Phonological Tokenizer is a single-codebook speech tokenizer that encodes linguistic and prosodic information. The tokenizer has intermediate properties between phonetic tokens and acoustic tokens.

This tokenizer is obtained by fine-tuning the phonetic tokens derived from an SSL model (wavlm-large) using differentiable k-means in a multi-task manner with ASR and speech reconstruction. In this repository, we release the fine-tuned SSL model and cluster centroids, along with simple inference code.

For more details, please refer to our paper.

Usage

git clone https://huggingface.co/Sony/Phonological-Tokenizer
cd Phonological-Tokenizer

pip install -r requirements.txt

python inference.py [audio file path]

License

This model is licensed under CC BY-SA 3.0. See the LICENSE file for details.

Citation

@inproceedings{onda2026phonological,
  title={Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means},
  author={Onda, Kentaro and Futami, Hayato and Kashiwagi, Yosuke and Tsunoo, Emiru and Watanabe, Shinji},
  booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={17817-17821},
  year={2026},
  organization={IEEE},
  doi={10.1109/ICASSP55912.2026.11464405}
}

Reference

  • Original SSL model: WavLM-large (CC BY-SA 3.0)
  • Training data:
    • VCTK (CC BY 4.0)
    • LibriSpeech (CC BY 4.0; used a 30h random subset of train-clean-100 for centroid initialization)

Contact

ondakentaro[at]gavo.t.u-tokyo.ac.jp; hayato.Futami[at]sony.com; Yosuke.Kashiwagi[at]sony.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Sony/Phonological-Tokenizer