ult2mel_2DCNN — Ultrasound-to-Mel (TaL80, 4 speakers)
Speaker-dependent 2D CNN that maps a single ultrasound (ULT) tongue-imaging frame to one 80-band log-mel frame. This repo bundles four per-speaker checkpoints trained on the TaL80 corpus.
Files
One checkpoint + scaler pair per speaker.
01fi/model.ckpt 01fi/scaler.pkl
02fe/model.ckpt 02fe/scaler.pkl
03mn/model.ckpt 03mn/scaler.pkl
04me/model.ckpt 04me/scaler.pkl
Usage
For training, inference, and a HuggingFace-tailored predict script, see the GitHub repository:
ibrahimkhaliloglu/ult-to-speech-pytorch
Intended use & limitations
Research baseline for ultrasound-to-speech conversion and silent speech interfaces; downstream vocoders (e.g. HiFi-GAN) can synthesize audio from the predicted mel.
- Speaker-dependent — each checkpoint works only for its own speaker.
- Frame-wise — no temporal context modeled.
License
The TaL80 corpus has its own license - users must comply with it.
Citation
@inproceedings{ibrahimov25_interspeech,
title = {{Conformer-based Ultrasound-to-Speech Conversion}},
author = {Ibrahim Ibrahimov and Csaba Zainkó and Gábor Gosztolya},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {5578--5582},
doi = {10.21437/Interspeech.2025-2147},
issn = {2958-1796},
}
Contact: Ibrahim Ibrahimov — ibrahimkhaliloglu@gmail.com
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support