FenomAI zhouyx1998 commited on
Commit
2f1090b
·
0 Parent(s):

Duplicate from openbmb/VoxCPM2

Browse files

Co-authored-by: Yixuan Zhou <zhouyx1998@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ - ar
6
+ - my
7
+ - da
8
+ - nl
9
+ - fi
10
+ - fr
11
+ - de
12
+ - el
13
+ - he
14
+ - hi
15
+ - id
16
+ - it
17
+ - ja
18
+ - km
19
+ - ko
20
+ - lo
21
+ - ms
22
+ - no
23
+ - pl
24
+ - pt
25
+ - ru
26
+ - es
27
+ - sw
28
+ - sv
29
+ - tl
30
+ - th
31
+ - tr
32
+ - vi
33
+ license: apache-2.0
34
+ library_name: voxcpm
35
+ tags:
36
+ - text-to-speech
37
+ - tts
38
+ - multilingual
39
+ - voice-cloning
40
+ - voice-design
41
+ - diffusion
42
+ - audio
43
+ pipeline_tag: text-to-speech
44
+ ---
45
+
46
+ # VoxCPM2
47
+
48
+ **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
49
+
50
+ [![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
51
+ [![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
52
+ [![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
53
+ [![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
54
+ [![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
55
+ [![Lark](https://img.shields.io/badge/飞书群-VoxCPM-00D6B9?logo=lark&logoColor=white)](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f)
56
+
57
+ ## Highlights
58
+
59
+ - 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly
60
+ - 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
61
+ - 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
62
+ - 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
63
+ - 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
64
+ - 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content
65
+ - ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)
66
+ - 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use
67
+
68
+
69
+ <summary><b>Supported Languages (30)</b></summary>
70
+
71
+ Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
72
+
73
+ Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
74
+
75
+
76
+ ## Quick Start
77
+
78
+ ### Installation
79
+
80
+ ```bash
81
+ pip install voxcpm
82
+ ```
83
+
84
+ **Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
85
+
86
+ ### Text-to-Speech
87
+
88
+ ```python
89
+ from voxcpm import VoxCPM
90
+ import soundfile as sf
91
+
92
+ model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
93
+
94
+ wav = model.generate(
95
+ text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
96
+ cfg_value=2.0,
97
+ inference_timesteps=10,
98
+ )
99
+ sf.write("output.wav", wav, model.tts_model.sample_rate)
100
+ ```
101
+
102
+ ### Voice Design
103
+
104
+ Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
105
+
106
+ ```python
107
+ wav = model.generate(
108
+ text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
109
+ cfg_value=2.0,
110
+ inference_timesteps=10,
111
+ )
112
+ sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
113
+ ```
114
+
115
+ ### Controllable Voice Cloning
116
+
117
+ ```python
118
+ # Basic cloning
119
+ wav = model.generate(
120
+ text="This is a cloned voice generated by VoxCPM2.",
121
+ reference_wav_path="speaker.wav",
122
+ )
123
+ sf.write("clone.wav", wav, model.tts_model.sample_rate)
124
+
125
+ # Cloning with style control
126
+ wav = model.generate(
127
+ text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
128
+ reference_wav_path="speaker.wav",
129
+ cfg_value=2.0,
130
+ inference_timesteps=10,
131
+ )
132
+ sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
133
+ ```
134
+
135
+ ### Ultimate Cloning
136
+
137
+ Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
138
+
139
+ ```python
140
+ wav = model.generate(
141
+ text="This is an ultimate cloning demonstration using VoxCPM2.",
142
+ prompt_wav_path="speaker_reference.wav",
143
+ prompt_text="The transcript of the reference audio.",
144
+ reference_wav_path="speaker_reference.wav",
145
+ )
146
+ sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
147
+ ```
148
+
149
+ ### Streaming
150
+
151
+ ```python
152
+ import numpy as np
153
+
154
+ chunks = []
155
+ for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
156
+ chunks.append(chunk)
157
+ wav = np.concatenate(chunks)
158
+ sf.write("streaming.wav", wav, model.tts_model.sample_rate)
159
+ ```
160
+
161
+ ## Model Details
162
+
163
+ | Property | Value |
164
+ |---|---|
165
+ | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
166
+ | Backbone | Based on MiniCPM-4, totally 2B parameters |
167
+ | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
168
+ | Training Data | 2M+ hours multilingual speech |
169
+ | LM Token Rate | 6.25 Hz |
170
+ | Max Sequence Length | 8192 tokens |
171
+ | dtype | bfloat16 |
172
+ | VRAM | ~8 GB |
173
+ | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
174
+
175
+ ## Performance
176
+
177
+ VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
178
+
179
+ See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
180
+
181
+ ## Fine-tuning
182
+
183
+ VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
184
+
185
+ ```bash
186
+ # LoRA fine-tuning (recommended)
187
+ python scripts/train_voxcpm_finetune.py \
188
+ --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
189
+
190
+ # Full fine-tuning
191
+ python scripts/train_voxcpm_finetune.py \
192
+ --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
193
+ ```
194
+
195
+ See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
196
+
197
+ ## Limitations
198
+
199
+ - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
200
+ - Performance varies across languages depending on training data availability.
201
+ - Occasional instability may occur with very long or highly expressive inputs.
202
+ - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
203
+
204
+ ## Citation
205
+
206
+ ```bibtex
207
+ @article{voxcpm2_2026,
208
+ title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
209
+ author = {VoxCPM Team},
210
+ journal = {GitHub},
211
+ year = {2026},
212
+ }
213
+
214
+ @article{voxcpm2025,
215
+ title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
216
+ author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
217
+ Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
218
+ Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
219
+ journal = {arXiv preprint arXiv:2509.24650},
220
+ year = {2025},
221
+ }
222
+ ```
223
+
224
+ ## License
225
+
226
+ Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.
audiovae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94b5d51e107e0507d4acc976cfdadb64edd6fd06d1f751dadbf2fd1594274bf1
3
+ size 376951122
config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "voxcpm2",
3
+ "lm_config": {
4
+ "bos_token_id": 1,
5
+ "eos_token_id": 2,
6
+ "hidden_size": 2048,
7
+ "intermediate_size": 6144,
8
+ "max_position_embeddings": 32768,
9
+ "num_attention_heads": 16,
10
+ "num_hidden_layers": 28,
11
+ "num_key_value_heads": 2,
12
+ "rms_norm_eps": 1e-05,
13
+ "rope_theta": 10000,
14
+ "kv_channels": 128,
15
+ "rope_scaling": {
16
+ "type": "longrope",
17
+ "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.615569542115128, 5.2684819496549835, 6.014438591970396, 6.858830049237097, 7.804668263503327, 8.851768731513417, 9.99600492938444, 11.228766118181639, 12.536757560834843, 13.902257701387796, 15.303885189125953, 16.717837610115794, 18.119465097853947, 19.484965238406907, 20.792956681060105, 22.02571786985731, 23.16995406772833, 24.217054535738416, 25.16289275000465, 26.007284207271347, 26.753240849586767, 27.40615325712662, 27.973003419175363, 28.461674954469114, 28.880393889607006, 29.237306864684626, 29.540186419591297, 29.79624387177199, 30.01202719065413, 30.193382037992453, 30.34545697551969, 30.47273746338473, 30.579096895249787, 30.66785612408345, 30.741845563814174, 30.80346599254902, 30.85474569563567, 30.897392663720595, 30.932841297560394, 30.962293553185553, 30.986754758742034, 31.007064503249293, 31.02392307921529],
18
+ "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.615569542115128, 5.2684819496549835, 6.014438591970396, 6.858830049237097, 7.804668263503327, 8.851768731513417, 9.99600492938444, 11.228766118181639, 12.536757560834843, 13.902257701387796, 15.303885189125953, 16.717837610115794, 18.119465097853947, 19.484965238406907, 20.792956681060105, 22.02571786985731, 23.16995406772833, 24.217054535738416, 25.16289275000465, 26.007284207271347, 26.753240849586767, 27.40615325712662, 27.973003419175363, 28.461674954469114, 28.880393889607006, 29.237306864684626, 29.540186419591297, 29.79624387177199, 30.01202719065413, 30.193382037992453, 30.34545697551969, 30.47273746338473, 30.579096895249787, 30.66785612408345, 30.741845563814174, 30.80346599254902, 30.85474569563567, 30.897392663720595, 30.932841297560394, 30.962293553185553, 30.986754758742034, 31.007064503249293, 31.02392307921529],
19
+ "original_max_position_embeddings": 32768
20
+ },
21
+ "vocab_size": 73448,
22
+ "use_mup": false,
23
+ "scale_emb": 12,
24
+ "dim_model_base": 256,
25
+ "scale_depth": 1.4
26
+ },
27
+ "patch_size": 4,
28
+ "feat_dim": 64,
29
+ "scalar_quantization_latent_dim": 512,
30
+ "scalar_quantization_scale": 9,
31
+ "residual_lm_num_layers": 8,
32
+ "residual_lm_no_rope": true,
33
+ "encoder_config": {
34
+ "hidden_dim": 1024,
35
+ "ffn_dim": 4096,
36
+ "num_heads": 16,
37
+ "num_layers": 12,
38
+ "kv_channels": 128
39
+ },
40
+ "dit_config": {
41
+ "hidden_dim": 1024,
42
+ "ffn_dim": 4096,
43
+ "num_heads": 16,
44
+ "num_layers": 12,
45
+ "kv_channels": 128,
46
+ "mean_mode": false,
47
+ "cfm_config": {
48
+ "sigma_min": 1e-06,
49
+ "solver": "euler",
50
+ "t_scheduler": "log-norm",
51
+ "inference_cfg_rate": 2.0
52
+ }
53
+ },
54
+ "audio_vae_config": {
55
+ "encoder_dim": 128,
56
+ "encoder_rates": [2, 5, 8, 8],
57
+ "latent_dim": 64,
58
+ "decoder_dim": 2048,
59
+ "decoder_rates": [8, 6, 5, 2, 2, 2],
60
+ "sr_bin_boundaries": [20000, 30000, 40000],
61
+ "sample_rate": 16000,
62
+ "out_sample_rate": 48000
63
+ },
64
+ "max_length": 8192,
65
+ "device": "cuda",
66
+ "dtype": "bfloat16"
67
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7f964cfa9da23653baec6e6f7750719977ad944ed9f95fe52fe3a620506891d
3
+ size 4580080592
special_tokens_map.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<|im_end|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<|im_start|>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<|tool_call|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ {
25
+ "content": "<|execute_start|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ {
32
+ "content": "<|execute_end|>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ {
39
+ "content": "<|fim_prefix|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ {
46
+ "content": "<|fim_middle|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false
51
+ },
52
+ {
53
+ "content": "<|fim_suffix|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false
58
+ }
59
+ ],
60
+ "bos_token": {
61
+ "content": "<s>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false
66
+ },
67
+ "eos_token": {
68
+ "content": "</s>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false
73
+ },
74
+ "unk_token": {
75
+ "content": "<unk>",
76
+ "lstrip": false,
77
+ "normalized": false,
78
+ "rstrip": false,
79
+ "single_word": false
80
+ }
81
+ }
tokenization_voxcpm2.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Custom tokenizer for VoxCPM2 that splits multi-character Chinese tokens.
2
+
3
+ VoxCPM2 was trained with ``mask_multichar_chinese_tokens`` which splits
4
+ multi-character Chinese tokens (e.g. "你好" -> ["你", "好"]) into individual
5
+ character IDs before embedding. The base LlamaTokenizerFast produces
6
+ multi-character Chinese tokens that the model has never seen during training,
7
+ yielding garbled Chinese audio output in downstream inference frameworks.
8
+
9
+ This module provides ``VoxCPM2Tokenizer`` which transparently applies the
10
+ character splitting inside ``encode()`` and ``__call__()``, so any downstream
11
+ consumer (vLLM, vLLM-Omni, Nano-vLLM, etc.) gets correct single-character
12
+ IDs without code changes.
13
+ """
14
+
15
+ from transformers import LlamaTokenizerFast
16
+
17
+
18
+ class VoxCPM2Tokenizer(LlamaTokenizerFast):
19
+
20
+ def __init__(self, *args, **kwargs):
21
+ super().__init__(*args, **kwargs)
22
+ self._split_map = self._build_split_map()
23
+
24
+ def _build_split_map(self) -> dict[int, list[int]]:
25
+ vocab = self.get_vocab()
26
+ split_map: dict[int, list[int]] = {}
27
+ for token, tid in vocab.items():
28
+ clean = token.replace("\u2581", "")
29
+ if len(clean) >= 2 and all(self._is_cjk(c) for c in clean):
30
+ char_ids = self.convert_tokens_to_ids(list(clean))
31
+ if all(c != self.unk_token_id for c in char_ids):
32
+ split_map[tid] = char_ids
33
+ return split_map
34
+
35
+ @staticmethod
36
+ def _is_cjk(c: str) -> bool:
37
+ return (
38
+ "\u4e00" <= c <= "\u9fff"
39
+ or "\u3400" <= c <= "\u4dbf"
40
+ or "\uf900" <= c <= "\ufaff"
41
+ or "\U00020000" <= c <= "\U0002a6df"
42
+ )
43
+
44
+ def _expand_ids(self, ids: list[int]) -> list[int]:
45
+ result: list[int] = []
46
+ for tid in ids:
47
+ expansion = self._split_map.get(tid)
48
+ if expansion is not None:
49
+ result.extend(expansion)
50
+ else:
51
+ result.append(tid)
52
+ return result
53
+
54
+ def encode(self, text, *args, **kwargs):
55
+ ids = super().encode(text, *args, **kwargs)
56
+ return self._expand_ids(ids)
57
+
58
+ def __call__(self, text, *args, **kwargs):
59
+ result = super().__call__(text, *args, **kwargs)
60
+ if hasattr(result, "input_ids"):
61
+ ids = result["input_ids"]
62
+ if isinstance(ids, list) and ids and isinstance(ids[0], list):
63
+ result["input_ids"] = [self._expand_ids(x) for x in ids]
64
+ if "attention_mask" in result:
65
+ result["attention_mask"] = [
66
+ [1] * len(x) for x in result["input_ids"]
67
+ ]
68
+ elif isinstance(ids, list):
69
+ result["input_ids"] = self._expand_ids(ids)
70
+ if "attention_mask" in result:
71
+ result["attention_mask"] = [1] * len(result["input_ids"])
72
+ return result
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "101": {
30
+ "content": "<|audio_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "102": {
38
+ "content": "<|audio_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "103": {
46
+ "content": "<|audio_prompt_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "104": {
54
+ "content": "<|audio_prompt_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "105": {
62
+ "content": "<|background|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "106": {
70
+ "content": "<|/background|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "107": {
78
+ "content": "<|characters|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "108": {
86
+ "content": "<|/characters|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "109": {
94
+ "content": "<|speaker_id|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "110": {
102
+ "content": "<|/speaker_id|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "111": {
110
+ "content": "<|span|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "112": {
118
+ "content": "<|/span|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "73440": {
126
+ "content": "<|im_end|>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "73441": {
134
+ "content": "<|im_start|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "73442": {
142
+ "content": "<|tool_call|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": true
148
+ },
149
+ "73443": {
150
+ "content": "<|execute_start|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": true
156
+ },
157
+ "73444": {
158
+ "content": "<|execute_end|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": true
164
+ },
165
+ "73445": {
166
+ "content": "<|fim_prefix|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": true
172
+ },
173
+ "73446": {
174
+ "content": "<|fim_middle|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": true
180
+ },
181
+ "73447": {
182
+ "content": "<|fim_suffix|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ }
189
+ },
190
+ "additional_special_tokens": [
191
+ "<|im_end|>",
192
+ "<|im_start|>",
193
+ "<|tool_call|>",
194
+ "<|execute_start|>",
195
+ "<|execute_end|>",
196
+ "<|fim_prefix|>",
197
+ "<|fim_middle|>",
198
+ "<|fim_suffix|>"
199
+ ],
200
+ "bos_token": "<s>",
201
+ "clean_up_tokenization_spaces": false,
202
+ "eos_token": "<|im_end|>",
203
+ "legacy": true,
204
+ "model_max_length": 1000000000000000019884624838656,
205
+ "pad_token": null,
206
+ "sp_model_kwargs": {},
207
+ "spaces_between_special_tokens": false,
208
+ "tokenizer_class": "VoxCPM2Tokenizer",
209
+ "unk_token": "<unk>",
210
+ "use_default_system_prompt": false,
211
+ "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
212
+ "auto_map": {
213
+ "AutoTokenizer": [
214
+ "tokenization_voxcpm2.VoxCPM2Tokenizer",
215
+ null
216
+ ]
217
+ }
218
+ }