| --- |
| language: |
| - en |
| - zh |
| license: apache-2.0 |
| tags: |
| - audio |
| - automatic-speech-recognition |
| - asr |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| <div align="center"> |
| <h1> |
| FireRedASR2S |
| <br> |
| A SOTA Industrial-Grade All-in-One ASR System |
| </h1> |
|
|
| </div> |
|
|
| [[Code]](https://github.com/FireRedTeam/FireRedASR2S) |
| [[Paper]](https://huggingface.co/papers/2603.10420) |
| [[Model]](https://huggingface.co/FireRedTeam) |
| [[Blog]](https://fireredteam.github.io/demos/firered_asr/) |
| [[Demo]](https://huggingface.co/spaces/FireRedTeam/FireRedASR) |
| |
| FireRedASR2-LLM is the 8B+ parameter variant of the FireRedASR2 system, designed to achieve state-of-the-art performance and enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model capabilities. |
| |
| The model was introduced in the paper [FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System](https://huggingface.co/papers/2603.10420). |
| |
| **Authors**: Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao Hu. |
| |
| ## π₯ News |
| - [2026.03.12] π₯ We release FireRedASR2S technical report. See [arXiv](https://arxiv.org/abs/2603.10420). |
| - [2026.03.05] π [vLLM](https://github.com/vllm-project/vllm/pull/35727) supports FireRedASR2-LLM. |
| - [2026.02.25] π₯ We release **FireRedASR2-LLM model weights**. [π€](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) [π€](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/) |
| |
| ## Sample Usage |
| |
| To use this model, please refer to the installation and setup instructions in the [official GitHub repository](https://github.com/FireRedTeam/FireRedASR2S). |
| |
| ```python |
| from fireredasr2s.fireredasr2 import FireRedAsr2, FireRedAsr2Config |
| |
| batch_uttid = ["hello_zh", "hello_en"] |
| batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"] |
|
|
| # FireRedASR2-LLM Configuration |
| asr_config = FireRedAsr2Config( |
| use_gpu=True, |
| decode_min_len=0, |
| repetition_penalty=1.0, |
| llm_length_penalty=0.0, |
| temperature=1.0 |
| ) |
| |
| # Load the model |
| model = FireRedAsr2.from_pretrained("llm", "FireRedTeam/FireRedASR2-LLM", asr_config) |
|
|
| # Transcribe |
| results = model.transcribe(batch_uttid, batch_wav_path) |
| print(results) |
| # [{'uttid': 'hello_zh', 'text': 'δ½ ε₯½δΈη', 'rtf': '0.0681', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'text': 'hello speech', 'rtf': '0.0681', 'wav': 'assets/hello_en.wav'}] |
| ``` |
| |
| ## Evaluation |
| |
| FireRedASR2-LLM achieves state-of-the-art accuracy across Mandarin and various Chinese dialects. |
| |
| | Metric | FireRedASR2-LLM | Doubao-ASR | Qwen3-ASR | Fun-ASR | |
| |:---:|:---:|:---:|:---:|:---:| |
| | **Avg CER (Mandarin, 4 sets)** | **2.89** | 3.69 | 3.76 | 4.16 | |
| | **Avg CER (Dialects, 19 sets)** | **11.55**| 15.39| 11.85| 12.76| |
| |
| ## FAQ |
| **Q: What audio format is supported?** |
| 16kHz 16-bit mono PCM wav. You can convert files using ffmpeg: |
| `ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>` |
|
|
| **Q: What are the input length limitations?** |
| FireRedASR2-LLM supports audio input up to 40s. |
|
|
| ## Citation |
| ```bibtex |
| @article{xu2026fireredasr2s, |
| title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System}, |
| author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao}, |
| journal={arXiv preprint arXiv:2603.10420}, |
| year={2026} |
| } |
| ``` |