Instructions to use acul3/chatterbox-executorch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use acul3/chatterbox-executorch with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - multilingual | |
| - en | |
| - zh | |
| - fr | |
| - de | |
| - es | |
| - ja | |
| - ko | |
| - pt | |
| - it | |
| - ru | |
| - ar | |
| - hi | |
| - tr | |
| - pl | |
| - nl | |
| - sv | |
| - da | |
| - fi | |
| - no | |
| - cs | |
| - ro | |
| - hu | |
| tags: | |
| - text-to-speech | |
| - executorch | |
| - on-device | |
| - android | |
| - voice-cloning | |
| - chatterbox | |
| license: apache-2.0 | |
| # Chatterbox Multilingual TTS β ExecuTorch Models | |
| Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/). | |
| **π¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub | |
| --- | |
| ## What's Here | |
| 9 ExecuTorch `.pte` files covering the complete TTS pipeline β from text input to 24kHz waveform β with zero PyTorch runtime required: | |
| | File | Size | Backend | Precision | Stage | | |
| |------|------|---------|-----------|-------| | |
| | `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding | | |
| | `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning | | |
| | `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding | | |
| | `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder | | |
| | `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill | | |
| | `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode | | |
| | `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder | | |
| | `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step | | |
| | `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder | | |
| | **Total** | **~2.6 GB** | | | | | |
| --- | |
| ## Quick Download | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| snapshot_download( | |
| "acul3/chatterbox-executorch", | |
| local_dir="et_models", | |
| repo_type="model" | |
| ) | |
| ``` | |
| --- | |
| ## Pipeline Overview | |
| ``` | |
| Text β MTLTokenizer β text tokens | |
| Reference Audio β VoiceEncoder + CAMPPlus β speaker conditioning | |
| β | |
| T3 Prefill (LlamaModel, conditioned) | |
| β | |
| T3 Decode (autoregressive, ~100 tokens) | |
| β | |
| S3Gen Encoder (Conformer) | |
| β | |
| CFM Step Γ 2 (flow matching) | |
| β | |
| HiFiGAN (vocoder, chunked) | |
| β | |
| 24kHz PCM waveform π΅ | |
| ``` | |
| --- | |
| ## Key Technical Notes | |
| - **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β bypasses HF `DynamicCache` for `torch.export` compatibility | |
| - **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support | |
| - **T3 models** are FP16 (XNNPACK half-precision kernels) β ~half the size of FP32 with near-identical quality | |
| - **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio) | |
| --- | |
| ## Usage | |
| See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) | |
| ```bash | |
| # Clone code | |
| git clone https://github.com/acul3/chatterbox-executorch.git | |
| cd chatterbox-executorch | |
| # Download models (this repo) | |
| python -c " | |
| from huggingface_hub import snapshot_download | |
| snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model') | |
| " | |
| # Run full PTE inference | |
| python test_true_full_pte.py | |
| ``` | |
| --- | |
| ## Android Integration | |
| These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with: | |
| ```kotlin | |
| val module = Module.load(context.filesDir.path + "/t3_prefill.pte") | |
| ``` | |
| With QNN/NPU delegation on a Snapdragon device, expect **10β50Γ speedup** over the CPU timings below. | |
| ## Performance (Jetson AGX Orin, CPU only) | |
| | Stage | Time | | |
| |-------|------| | |
| | Voice encoding | ~1s | | |
| | T3 prefill | ~22s | | |
| | T3 decode (~100 tokens) | ~800s total (~8s/token) | | |
| | S3Gen encoder | ~2s | | |
| | CFM (2 steps) | ~40s | | |
| | HiFiGAN | ~10s/chunk | | |
| --- | |
| ## License | |
| Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms. | |