tencent
/

StableToken

@@ -1,12 +1,120 @@
 ---
 license: other
 license_name: license-term-of-stabletoken
 language:
 - en
 - zh
 tags:
 - speech tokenizer
----
 # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)
 **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.
@@ -104,3 +212,4 @@ Measurements on LibriSpeech (LS) and SEED benchmarks.
 ## License
 This project is licensed under the [License Term of StableToken](LICENSE).

 ---
+language:
+- en
+- zh
 license: other
 license_name: license-term-of-stabletoken
+tags:
+- speech tokenizer
+pipeline_tag: audio-to-audio
+---
+# StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)
+**StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.
+📄 [Paper](https://huggingface.co/papers/2509.22220) | 💻 [GitHub](https://github.com/Tencent/StableToken)
+For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken).
+## Model Details
+| Attribute | Value |
+|:----------|:------|
+| Frame Rate | 25 Hz |
+| Codebook Size | 8,192 |
+| BPS (Bits Per Second) | 325 |
+## Quick Start
+To use StableToken, please clone the official repository and install dependencies.
+### Installation
+```bash
+git clone --recursive https://github.com/Tencent/StableToken.git
+cd StableToken && pip install -r requirements.txt
+```
+### Inference
+```python
+import os
+from huggingface_hub import snapshot_download
+from transformers import WhisperFeatureExtractor
+from src.model.modeling_whisper import WhisperLFQEncoder
+from src.utils.flow_inference import AudioDecoder
+from src.utils.utils import extract_speech_token, speech_token_to_wav
+# 1. Download & Load Models
+model_dir = snapshot_download("tencent/StableToken")
+# Load Tokenizer
+tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
+feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))
+# Load Decoder
+decoder = AudioDecoder(
+    config_path=os.path.join(model_dir, "decoder", "config.yaml"),
+    flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
+    hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
+    device="cuda"
+)
+# 2. Tokenize
+tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]
+# 3. Reconstruct
+tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
+```
+## Performance
+StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers.
+### Noise Robustness (UED ↓)
+| Model | Frame Rate | Codebook Size | UED (%, ↓) |
+|:---|:---:|:---:|:---:|
+| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 |
+| [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 |
+| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 |
+| **StableToken** | 25Hz | 8,192 | **10.17** 🏆 |
+### Reconstruction Quality
+Measurements on LibriSpeech (LS) and SEED benchmarks.
+| Model | Frame<br>Rate | BPS | WER (↓)<br>LS-clean | WER (↓)<br>LS-other | WER (↓)<br>SEED-en | WER (↓)<br>SEED-zh | MOS (↑)<br>LS-clean | MOS (↑)<br>LS-other | MOS (↑)<br>SEED-en | MOS (↑)<br>SEED-zh |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 |
+| [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 |
+| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 |
+| **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** |
+## Citation
+```bibtex
+@article{song2025stabletoken,
+  title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
+  author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan bitwise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL .
+# Current model card
+The README of the model repository currently looks like this:
+## Metadata
+```yaml
 language:
 - en
 - zh
+license: other
+license_name: license-term-of-stabletoken
 tags:
 - speech tokenizer
+```
+## Content
 # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)
 **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.
 ## License
 This project is licensed under the [License Term of StableToken](LICENSE).
+```