Chatterbox TTS β€” GGUF (ggml-quantised)

GGUF / ggml conversion of ResembleAI/chatterbox for use with CrispStrobe/CrispASR.

Chatterbox is a full TTS pipeline: character tokenizer β†’ T3 (30-layer Llama AR, 520M) β†’ speech tokens β†’ S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) β†’ HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) β†’ 24 kHz WAV. Distributed under MIT license.

Two GGUF files are needed: the T3 model (text β†’ speech tokens) and the S3Gen model (speech tokens β†’ audio).

Files

File Quant Size Notes
chatterbox-t3-f16.gguf F16 1.1 GB T3 AR model β€” reference quality
chatterbox-t3-q8_0.gguf Q8_0 630 MB T3 AR model β€” recommended
chatterbox-t3-q4_k.gguf Q4_K 374 MB T3 AR model β€” smallest
chatterbox-s3gen-f16.gguf F16 574 MB S3Gen + vocoder β€” reference quality
chatterbox-s3gen-q8_0.gguf Q8_0 358 MB S3Gen + vocoder β€” recommended
chatterbox-s3gen-q4_k.gguf Q4_K 248 MB S3Gen + vocoder β€” smallest

Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.

The T3 GGUF files include the BPE tokenizer.ggml.tokens + tokenizer.ggml.merges arrays. Earlier (pre-2026-05-08) uploads were missing the merges key, causing the CrispASR loader to fall back to a char-level tokenizer that dropped uppercase letters and spaces; if you see ASR roundtrip degradation against these files in a downstream check, re-pull them.

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox

# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .

# 3. Synthesise (C API / test binary β€” CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage

Architecture

Text β†’ Character tokenizer (704 tokens)
     β†’ T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
     β†’ 25 Hz speech tokens (6561 codebook)
     β†’ Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
     β†’ 80-channel mel spectrogram
     β†’ UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
     β†’ HiFTGenerator vocoder (3Γ— ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
     β†’ 24 kHz mono WAV

Quality verification

ASR roundtrip on Python reference mel (no source fusion, deterministic):

Metric Value
ASR output (moonshine-base) "Hello world" (correct)
Per-stage cosine vs Python ref 1.000 (conv_pre through rb_2)
Waveform cosine vs torch.istft 0.93
STFT range [-0.82, 2.0] (ref [-1.1, 1.7])

All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.

The crispasr-diff chatterbox … harness reports [PASS] t3_cond_emb cosβ‰ˆ1.000, [PASS] t3_prefill_emb[0] cosβ‰ˆ1.000 against the F16 reference for all three quant levels.

Conversion

python models/convert-chatterbox-to-gguf.py \
    --input ResembleAI/chatterbox \
    --output-dir .

Requires pip install gguf safetensors torch huggingface_hub.

Related models

License

MIT β€” same as the upstream ResembleAI/chatterbox.

Downloads last month
874
GGUF
Model size
0.3B params
Architecture
chatterbox-s3gen
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/chatterbox-GGUF

Quantized
(18)
this model