ult2mel_2DCNN — Ultrasound-to-Mel (TaL80, 4 speakers)

Speaker-dependent 2D CNN that maps a single ultrasound (ULT) tongue-imaging frame to one 80-band log-mel frame. This repo bundles four per-speaker checkpoints trained on the TaL80 corpus.

Files

One checkpoint + scaler pair per speaker.

01fi/model.ckpt   01fi/scaler.pkl
02fe/model.ckpt   02fe/scaler.pkl
03mn/model.ckpt   03mn/scaler.pkl
04me/model.ckpt   04me/scaler.pkl

Usage

For training, inference, and a HuggingFace-tailored predict script, see the GitHub repository:

ibrahimkhaliloglu/ult-to-speech-pytorch

Intended use & limitations

Research baseline for ultrasound-to-speech conversion and silent speech interfaces; downstream vocoders (e.g. HiFi-GAN) can synthesize audio from the predicted mel.

  • Speaker-dependent — each checkpoint works only for its own speaker.
  • Frame-wise — no temporal context modeled.

License

The TaL80 corpus has its own license - users must comply with it.

Citation

@inproceedings{ibrahimov25_interspeech,
  title     = {{Conformer-based Ultrasound-to-Speech Conversion}},
  author    = {Ibrahim Ibrahimov and Csaba Zainkó and Gábor Gosztolya},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {5578--5582},
  doi       = {10.21437/Interspeech.2025-2147},
  issn      = {2958-1796},
}

Contact: Ibrahim Ibrahimov — ibrahimkhaliloglu@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support