| | --- |
| | license: apache-2.0 |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # Soprano: Instant, Ultra‑Realistic Text‑to‑Speech |
| |
|
| | <div align="center"> |
| | |
| | <img width="640" height="320" alt="soprano-github" src="https://github.com/user-attachments/assets/4d612eac-23b8-44e6-8c59-d7ac14ebafd1" /> |
| |
|
| | [](https://github.com/ekwek1/soprano) |
| | [](https://huggingface.co/spaces/ekwek/Soprano-TTS) |
| | </div> |
| |
|
| | ### 📰 News |
| | **2026.01.13 - [Soprano-Factory](https://github.com/ekwek1/soprano-factory) released! You can now train/fine-tune your own Soprano models.** |
| | 2025.12.22 - Soprano-80M released! [Code](https://github.com/ekwek1/soprano) | [Demo](https://huggingface.co/spaces/ekwek/Soprano-TTS) |
| |
|
| | --- |
| |
|
| | This repository contains **Soprano-Encoder**, which converts raw audio into audio tokens that the LLM backbone can recognize. |
| |
|
| | ## Overview |
| |
|
| | **Soprano** is an ultra‑lightweight, on-device text‑to‑speech (TTS) model designed for expressive, high‑fidelity speech synthesis at unprecedented speed. Soprano was designed with the following features: |
| | - Up to **2000x** real-time generation on GPU and **20x** real-time on CPU |
| | - **Lossless streaming** with **<15 ms** latency on GPU, **<250 ms** on CPU |
| | - **<1 GB** memory usage with a compact 80M parameter architecture |
| | - **Infinite generation length** with automatic text splitting |
| | - Highly expressive, crystal clear audio generation at **32kHz** |
| | - Widespread support for CUDA, CPU, and MPS devices on Windows, Linux, and Mac |
| | - Supports WebUI, CLI, and OpenAI-compatible endpoint for easy and production-ready inference |
| |
|
| | --- |