Instructions to use cstr/chatterbox-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use cstr/chatterbox-GGUF with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
Chatterbox TTS β GGUF (ggml-quantised)
GGUF / ggml conversion of ResembleAI/chatterbox for use with CrispStrobe/CrispASR.
Chatterbox is a full TTS pipeline: character tokenizer β T3 (30-layer Llama AR, 520M) β speech tokens β S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) β HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) β 24 kHz WAV. Distributed under MIT license.
Two GGUF files are needed: the T3 model (text β speech tokens) and the S3Gen model (speech tokens β audio).
Files
| File | Quant | Size | Notes |
|---|---|---|---|
chatterbox-t3-f16.gguf |
F16 | 1.1 GB | T3 AR model β reference quality |
chatterbox-t3-q8_0.gguf |
Q8_0 | 630 MB | T3 AR model β recommended |
chatterbox-t3-q4_k.gguf |
Q4_K | 374 MB | T3 AR model β smallest |
chatterbox-s3gen-f16.gguf |
F16 | 574 MB | S3Gen + vocoder β reference quality |
chatterbox-s3gen-q8_0.gguf |
Q8_0 | 358 MB | S3Gen + vocoder β recommended |
chatterbox-s3gen-q4_k.gguf |
Q4_K | 248 MB | S3Gen + vocoder β smallest |
Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.
The T3 GGUF files include the BPE tokenizer.ggml.tokens + tokenizer.ggml.merges arrays. Earlier (pre-2026-05-08) uploads were missing the merges key, causing the CrispASR loader to fall back to a char-level tokenizer that dropped uppercase letters and spaces; if you see ASR roundtrip degradation against these files in a downstream check, re-pull them.
Quick start
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox
# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .
# 3. Synthesise (C API / test binary β CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage
Architecture
Text β Character tokenizer (704 tokens)
β T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
β 25 Hz speech tokens (6561 codebook)
β Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
β 80-channel mel spectrogram
β UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
β HiFTGenerator vocoder (3Γ ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
β 24 kHz mono WAV
Quality verification
ASR roundtrip on Python reference mel (no source fusion, deterministic):
| Metric | Value |
|---|---|
| ASR output (moonshine-base) | "Hello world" (correct) |
| Per-stage cosine vs Python ref | 1.000 (conv_pre through rb_2) |
| Waveform cosine vs torch.istft | 0.93 |
| STFT range | [-0.82, 2.0] (ref [-1.1, 1.7]) |
All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.
The crispasr-diff chatterbox β¦ harness reports [PASS] t3_cond_emb cosβ1.000, [PASS] t3_prefill_emb[0] cosβ1.000 against the F16 reference for all three quant levels.
Conversion
python models/convert-chatterbox-to-gguf.py \
--input ResembleAI/chatterbox \
--output-dir .
Requires pip install gguf safetensors torch huggingface_hub.
Related models
cstr/lahgtna-chatterbox-v1-GGUFβ Arabic T3 variant (MIT, shares S3Gen)cstr/orpheus-3b-base-GGUFβ Llama-3.2 + SNAC TTScstr/qwen3-tts-0.6b-customvoice-GGUFβ Qwen3-TTS with fixed speakers
License
MIT β same as the upstream ResembleAI/chatterbox.
- Downloads last month
- 874
8-bit
16-bit
Model tree for cstr/chatterbox-GGUF
Base model
ResembleAI/chatterbox