YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†
†Corresponding author
1Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2School of Intelligence Science and Technology, Nanjing University
3Shanghai Lingguang Zhaxian Technology
Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.
🌟 Highlights
- Agentic multi-turn reasoning: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding
- Speaker-aware context cache: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks
- Three-stage progressive training: multi-task foundation → temporal interaction learning → cache-conditioned decoding
- State-of-the-art performance: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4
- 🔥 Bilingual & Scaled up: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios
📊 Results
Comprehensive Multi-Domain Evaluation
We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.
| Model | Video-Internal-Eval | Video-Internal-Eval-zh | Video-Internal-Eval-en | AISHELL4-Eval | Alimeeting-Far | AMI-SDM | MLC-SLM-Eval-1 | MLC-SLM-Eval-2 | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | |
| Gemini-2.5-Pro | 22.47 | 44.13 | 74.05 | 21.66 | 18.28 | 40.97 | 69.35 | 22.69 | 55.40 | 68.82 | 100.95 | 13.42 | 19.81 | 25.11 | 36.07 | 5.30 | 30.16 | 39.29 | 56.39 | 9.13 | 31.66 | 39.98 | 50.28 | 8.32 | 36.87 | 41.88 | 42.33 | 5.01 | 26.73 | 32.19 | 46.19 | 5.46 |
| VibeVoice-ASR | 16.45 | 58.60 | 47.18 | 42.15 | 17.70 | 62.06 | 47.65 | 44.36 | 7.11 | 32.65 | 44.62 | 25.54 | 22.19 | 26.16 | 8.94 | 3.97 | 34.31 | 39.92 | 19.62 | 5.61 | 30.53 | 35.86 | 21.00 | 5.33 | 10.30 | 13.45 | 6.27 | 3.15 | 7.97 | 11.38 | 3.14 | 3.41 |
| Speaker-Reasoner Multi-turn | 6.27 | 24.43 | 15.33 | 18.16 | 6.50 | 25.81 | 16.68 | 19.31 | 4.42 | 16.31 | 7.58 | 11.89 | 7.13 | 8.14 | 3.38 | 1.01 | 19.72 | 19.92 | 6.70 | 0.20 | 23.29 | 25.16 | 13.56 | 1.87 | 9.17 | 11.74 | 4.76 | 2.57 | 8.54 | 11.76 | 4.35 | 3.22 |
Segmented Evaluation (40–50s segments)
| Model | AISHELL4-Eval | Alimeeting-Far | ||||||
|---|---|---|---|---|---|---|---|---|
| DER↓ | CER↓ | cpCER↓ | ∆cp↓ | DER↓ | CER↓ | cpCER↓ | ∆cp↓ | |
| Cascade Baselines | ||||||||
| Pyannote3.1 + Paraformer | 8.10 | 19.18 | 26.24 | 7.06 | 19.13 | 30.15 | 45.39 | 15.24 |
| End-to-End Baselines | ||||||||
| Gemini-2.5-Pro† | 36.07 | 19.81 | 25.11 | 5.30 | 56.39 | 30.16 | 39.29 | 9.13 |
| Qwen3-Omni-30B-A3B-Instruct | 32.42 | 14.46 | 22.22 | 7.76 | 37.15 | 25.40 | 36.28 | 10.88 |
| Qwen2.5-Omni-7B | 85.68 | 33.37 | 60.45 | 27.08 | 91.77 | 38.13 | 73.38 | 35.25 |
| SpeakerLM (212.25h) | – | 17.75 | 26.14 | 8.39 | – | 18.63 | 32.22 | 13.59 |
| SpeakerLM (7638.95h) | – | 17.17 | 18.37 | 1.20 | – | 13.97 | 16.05 | 2.08 |
| VibeVoice-ASR | 10.88 | 22.30 | 26.30 | 4.00 | 20.70 | 34.67 | 40.54 | 5.87 |
| TagSpeech-Alimeeting | 37.51 | 35.70 | 53.44 | 17.74 | 52.46 | 47.11 | 68.74 | 21.63 |
| Ours | ||||||||
| Qwen3-Omni + SOT sft (Stage 1) | – | 17.65 | 19.59 | 1.94 | – | 24.24 | 26.03 | 1.79 |
| Speaker-Reasoner Base (Stage 1) | 6.24 | 14.04 | 16.54 | 2.50 | 8.96 | 21.16 | 22.64 | 1.48 |
| Speaker-Reasoner Multi-turn (Stage 2) | 5.19 | 13.83 | 14.93 | 1.10 | 7.47 | 20.34 | 20.29 | −0.05 |
| Speaker-Reasoner Multi-turn w/ SAC (Stage 3) | 5.26 | 13.83 | 14.73 | 0.90 | 7.34 | 20.57 | 20.43 | −0.14 |
| Speaker-Reasoner Base 7B | 12.00 | 15.65 | 25.60 | 9.95 | 18.43 | 24.97 | 38.12 | 13.15 |
| Speaker-Reasoner Multi-turn 7B | 9.38 | 15.31 | 22.91 | 7.60 | 15.56 | 24.33 | 34.81 | 10.48 |
† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats.
Long-form Evaluation (without segmentation)
| Model | AISHELL4-Eval DER↓ | AISHELL4-Eval cpCER↓ |
|---|---|---|
| Gemini-2.5-Pro | 15.32 | 31.59 |
| Speaker-Reasoner Multi-turn w/ SAC | 21.60 | 36.20 |
Speaker Attribute Evaluation (AISHELL4-Eval)
| Model | Gender ACC↑ | Speaker Count ACC (SCA)↑ |
|---|---|---|
| Gemini-2.5-Pro | 94.80 | 67.03 |
| Qwen3-Omni-30B-A3B-Instruct | 97.12 | 60.49 |
| Speaker-Reasoner Multi-turn | 96.80 | 69.03 |
Installation
Environment Setup
git clone https://github.com/ASLP-lab/Speaker-Reasoner.git
cd Speaker-Reasoner
conda create -n speaker-reasoner python=3.10 -y
conda activate speaker-reasoner
Install MS-Swift and dependencies:
pip install ms-swift
Model Download
We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements:
| Model Version | Description | Language | Download |
|---|---|---|---|
| Speaker-Reasoner | The standard multi-turn model evaluated in the main paper. | ZH | 🤗 Hugging Face |
| Speaker-Reasoner-4194h | Scaled-up version trained on 4,194 hours of multi-domain data. | ZH/EN | 🤗 Hugging Face |
Training
Coming soon.
Inference
vLLM
Speaker-Reasoner is built on top of Qwen3-Omni-30B-A3B-Instruct. To run it, you will need to install a custom branch of vLLM from source.
git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
For more details on compiling vLLM from source, refer to the vLLM official documentation.
Citation
If you find this work useful, please cite:
@article{lin2026speakerreasoner,
title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR},
author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie},
year={2026},
eprint={2604.03074},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.03074},
}
License
The code in this repository is released under the Apache 2.0 License.
Contact
- Issues: Please open a GitHub Issue for bug reports or suggestions.
- Email: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn
- Downloads last month
- 46
