Chatterbox TTS — GGUF (ggml-quantised)

GGUF / ggml conversion of ResembleAI/chatterbox for use with CrispStrobe/CrispASR.

Chatterbox is a full TTS pipeline: character tokenizer → T3 (30-layer Llama AR, 520M) → speech tokens → S3Gen (Conformer encoder + UNet1D CFM denoiser, 10 Euler steps) → HiFTGenerator vocoder (conv chains + Snake activations + iSTFT) → 24 kHz WAV. Distributed under MIT license.

Two GGUF files are needed: the T3 model (text → speech tokens) and the S3Gen model (speech tokens → audio).

Files

File	Quant	Size	Notes
`chatterbox-t3-f16.gguf`	F16	1.1 GB	T3 AR model — reference quality
`chatterbox-t3-q8_0.gguf`	Q8_0	630 MB	T3 AR model — recommended
`chatterbox-t3-q4_k.gguf`	Q4_K	374 MB	T3 AR model — smallest
`chatterbox-s3gen-f16.gguf`	F16	574 MB	S3Gen + vocoder — reference quality
`chatterbox-s3gen-q8_0.gguf`	Q8_0	358 MB	S3Gen + vocoder — recommended
`chatterbox-s3gen-q4_k.gguf`	Q4_K	248 MB	S3Gen + vocoder — smallest

Note: vocoder weights (conv_pre, resblocks, conv_post, source fusion) are kept at F32 in all quant levels for audio quality. Quantization applies to the Conformer encoder, UNet decoder, and T3 Llama layers.

The T3 GGUF files include the BPE tokenizer.ggml.tokens + tokenizer.ggml.merges arrays. Earlier (pre-2026-05-08) uploads were missing the merges key, causing the CrispASR loader to fall back to a char-level tokenizer that dropped uppercase letters and spaces; if you see ASR roundtrip degradation against these files in a downstream check, re-pull them.

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build -j --target chatterbox

# 2. Pull both model files
huggingface-cli download cstr/chatterbox-GGUF chatterbox-t3-q8_0.gguf --local-dir .
huggingface-cli download cstr/chatterbox-GGUF chatterbox-s3gen-q8_0.gguf --local-dir .

# 3. Synthesise (C API / test binary — CLI adapter in progress)
# See tests/test_voc_wav.cpp for vocoder-only usage

Architecture

Text → Character tokenizer (704 tokens)
     → T3 Llama AR (30 layers, 1024D, 16 heads, RoPE, SwiGLU, CFG)
     → 25 Hz speech tokens (6561 codebook)
     → Conformer encoder (6 pre + 4 post upsample, 512D, 8 heads)
     → 80-channel mel spectrogram
     → UNet1D CFM denoiser (1 down + 12 mid + 1 up, 256 ch, 10 Euler steps)
     → HiFTGenerator vocoder (3× ConvTranspose1d + 9 ResBlocks + Snake + iSTFT)
     → 24 kHz mono WAV

Quality verification

ASR roundtrip on Python reference mel (no source fusion, deterministic):

Metric	Value
ASR output (moonshine-base)	"Hello world" (correct)
Per-stage cosine vs Python ref	1.000 (conv_pre through rb_2)
Waveform cosine vs torch.istft	0.93
STFT range	[-0.82, 2.0] (ref [-1.1, 1.7])

All quantization levels (F16/Q8_0/Q4_K) produce ASR-identical output on the reference mel.

The crispasr-diff chatterbox … harness reports [PASS] t3_cond_emb cos≈1.000, [PASS] t3_prefill_emb[0] cos≈1.000 against the F16 reference for all three quant levels.

Conversion

python models/convert-chatterbox-to-gguf.py \
    --input ResembleAI/chatterbox \
    --output-dir .

Requires pip install gguf safetensors torch huggingface_hub.

Related models

cstr/lahgtna-chatterbox-v1-GGUF — Arabic T3 variant (MIT, shares S3Gen)
cstr/orpheus-3b-base-GGUF — Llama-3.2 + SNAC TTS
cstr/qwen3-tts-0.6b-customvoice-GGUF — Qwen3-TTS with fixed speakers

License

MIT — same as the upstream ResembleAI/chatterbox.

Downloads last month: 874

GGUF

Model size

0.3B params

Architecture

chatterbox-s3gen

Hardware compatibility

8-bit

16-bit

Model tree for cstr/chatterbox-GGUF

Base model

ResembleAI/chatterbox

Quantized

(18)

this model