Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)

d53caf4 verified 2 months ago

4.35 kB

	---
	language:
	- multilingual
	- en
	- zh
	- fr
	- de
	- es
	- ja
	- ko
	- pt
	- it
	- ru
	- ar
	- hi
	- tr
	- pl
	- nl
	- sv
	- da
	- fi
	- no
	- cs
	- ro
	- hu
	tags:
	- text-to-speech
	- executorch
	- on-device
	- android
	- voice-cloning
	- chatterbox
	license: apache-2.0
	---

	# Chatterbox Multilingual TTS — ExecuTorch Models

	Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).

	📦 Code & export scripts: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub

	---

	## What's Here

	9 ExecuTorch `.pte` files covering the complete TTS pipeline — from text input to 24kHz waveform — with zero PyTorch runtime required:

	\| File \| Size \| Backend \| Precision \| Stage \|
	\|------\|------\|---------\|-----------\|-------\|
	\| `voice_encoder.pte` \| 7 MB \| portable \| FP32 \| Speaker embedding \|
	\| `xvector_encoder.pte` \| 27 MB \| portable \| FP32 \| X-vector conditioning \|
	\| `t3_cond_speech_emb.pte` \| 49 MB \| portable \| FP32 \| Speech token embedding \|
	\| `t3_cond_enc.pte` \| 18 MB \| portable \| FP32 \| Text/conditioning encoder \|
	\| `t3_prefill.pte` \| 1010 MB \| XNNPACK \| FP16 \| T3 Transformer prefill \|
	\| `t3_decode.pte` \| 1002 MB \| XNNPACK \| FP16 \| T3 Transformer decode \|
	\| `s3gen_encoder.pte` \| 178 MB \| portable \| FP32 \| S3Gen Conformer encoder \|
	\| `cfm_step.pte` \| 274 MB \| XNNPACK \| FP32 \| CFM flow matching step \|
	\| `hifigan.pte` \| 84 MB \| XNNPACK \| FP32 \| HiFiGAN vocoder \|
	\| Total \| ~2.6 GB \| \| \| \|

	---

	## Quick Download

	```python
	from huggingface_hub import snapshot_download

	snapshot_download(
	"acul3/chatterbox-executorch",
	local_dir="et_models",
	repo_type="model"
	)
	```

	---

	## Pipeline Overview

	```
	Text → MTLTokenizer → text tokens
	Reference Audio → VoiceEncoder + CAMPPlus → speaker conditioning
	↓
	T3 Prefill (LlamaModel, conditioned)
	↓
	T3 Decode (autoregressive, ~100 tokens)
	↓
	S3Gen Encoder (Conformer)
	↓
	CFM Step × 2 (flow matching)
	↓
	HiFiGAN (vocoder, chunked)
	↓
	24kHz PCM waveform 🎵
	```

	---

	## Key Technical Notes

	- T3 Decode uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) — bypasses HF `DynamicCache` for `torch.export` compatibility
	- HiFiGAN uses a manual real-valued DFT (cosine/sine matrix multiply) — replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
	- T3 models are FP16 (XNNPACK half-precision kernels) — ~half the size of FP32 with near-identical quality
	- Fixed shapes: CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)

	---

	## Usage

	See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)

	```bash
	# Clone code
	git clone https://github.com/acul3/chatterbox-executorch.git
	cd chatterbox-executorch

	# Download models (this repo)
	python -c "
	from huggingface_hub import snapshot_download
	snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
	"

	# Run full PTE inference
	python test_true_full_pte.py
	```

	---

	## Android Integration

	These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:

	```kotlin
	val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
	```

	With QNN/NPU delegation on a Snapdragon device, expect 10–50× speedup over the CPU timings below.

	## Performance (Jetson AGX Orin, CPU only)

	\| Stage \| Time \|
	\|-------\|------\|
	\| Voice encoding \| ~1s \|
	\| T3 prefill \| ~22s \|
	\| T3 decode (~100 tokens) \| ~800s total (~8s/token) \|
	\| S3Gen encoder \| ~2s \|
	\| CFM (2 steps) \| ~40s \|
	\| HiFiGAN \| ~10s/chunk \|

	---

	## License

	Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.