Add pipeline tag and link to paper

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +110 -1
README.md CHANGED
@@ -1,12 +1,120 @@
1
  ---
 
 
 
2
  license: other
3
  license_name: license-term-of-stabletoken
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  language:
5
  - en
6
  - zh
 
 
7
  tags:
8
  - speech tokenizer
9
- ---
 
 
10
  # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)
11
 
12
  **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.
@@ -104,3 +212,4 @@ Measurements on LibriSpeech (LS) and SEED benchmarks.
104
  ## License
105
 
106
  This project is licensed under the [License Term of StableToken](LICENSE).
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - zh
5
  license: other
6
  license_name: license-term-of-stabletoken
7
+ tags:
8
+ - speech tokenizer
9
+ pipeline_tag: audio-to-audio
10
+ ---
11
+
12
+ # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)
13
+
14
+ **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.
15
+
16
+ πŸ“„ [Paper](https://huggingface.co/papers/2509.22220) | πŸ’» [GitHub](https://github.com/Tencent/StableToken)
17
+
18
+ For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken).
19
+
20
+ ## Model Details
21
+
22
+ | Attribute | Value |
23
+ |:----------|:------|
24
+ | Frame Rate | 25 Hz |
25
+ | Codebook Size | 8,192 |
26
+ | BPS (Bits Per Second) | 325 |
27
+
28
+ ## Quick Start
29
+
30
+ To use StableToken, please clone the official repository and install dependencies.
31
+
32
+ ### Installation
33
+
34
+ ```bash
35
+ git clone --recursive https://github.com/Tencent/StableToken.git
36
+ cd StableToken && pip install -r requirements.txt
37
+ ```
38
+
39
+ ### Inference
40
+
41
+ ```python
42
+ import os
43
+ from huggingface_hub import snapshot_download
44
+ from transformers import WhisperFeatureExtractor
45
+ from src.model.modeling_whisper import WhisperLFQEncoder
46
+ from src.utils.flow_inference import AudioDecoder
47
+ from src.utils.utils import extract_speech_token, speech_token_to_wav
48
+
49
+ # 1. Download & Load Models
50
+ model_dir = snapshot_download("tencent/StableToken")
51
+
52
+ # Load Tokenizer
53
+ tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
54
+ feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))
55
+
56
+ # Load Decoder
57
+ decoder = AudioDecoder(
58
+ config_path=os.path.join(model_dir, "decoder", "config.yaml"),
59
+ flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
60
+ hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
61
+ device="cuda"
62
+ )
63
+
64
+ # 2. Tokenize
65
+ tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]
66
+
67
+ # 3. Reconstruct
68
+ tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
69
+ ```
70
+
71
+ ## Performance
72
+
73
+ StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers.
74
+
75
+ ### Noise Robustness (UED ↓)
76
+
77
+ | Model | Frame Rate | Codebook Size | UED (%, ↓) |
78
+ |:---|:---:|:---:|:---:|
79
+ | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 |
80
+ | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 |
81
+ | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 |
82
+ | **StableToken** | 25Hz | 8,192 | **10.17** πŸ† |
83
+
84
+ ### Reconstruction Quality
85
+
86
+ Measurements on LibriSpeech (LS) and SEED benchmarks.
87
+
88
+ | Model | Frame<br>Rate | BPS | WER (↓)<br>LS-clean | WER (↓)<br>LS-other | WER (↓)<br>SEED-en | WER (↓)<br>SEED-zh | MOS (↑)<br>LS-clean | MOS (↑)<br>LS-other | MOS (↑)<br>SEED-en | MOS (↑)<br>SEED-zh |
89
+ |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
90
+ | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 |
91
+ | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 |
92
+ | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 |
93
+ | **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** |
94
+
95
+ ## Citation
96
+
97
+ ```bibtex
98
+ @article{song2025stabletoken,
99
+ title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
100
+ author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan bitwise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL .
101
+
102
+ # Current model card
103
+
104
+ The README of the model repository currently looks like this:
105
+
106
+ ## Metadata
107
+ ```yaml
108
  language:
109
  - en
110
  - zh
111
+ license: other
112
+ license_name: license-term-of-stabletoken
113
  tags:
114
  - speech tokenizer
115
+ ```
116
+
117
+ ## Content
118
  # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)
119
 
120
  **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.
 
212
  ## License
213
 
214
  This project is licensed under the [License Term of StableToken](LICENSE).
215
+ ```