YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Zhennan Lin¹, Shuai Wang², Zhaokai Sun¹, Pengyuan Xie³, Chuan Xie³, Jie Liu³, Qiang Zhang³, Lei Xie^1†

^†Corresponding author

¹Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
²School of Intelligence Science and Technology, Nanjing University
³Shanghai Lingguang Zhaxian Technology

Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.

🌟 Highlights

Agentic multi-turn reasoning: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding
Speaker-aware context cache: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks
Three-stage progressive training: multi-task foundation → temporal interaction learning → cache-conditioned decoding
State-of-the-art performance: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4
🔥 Bilingual & Scaled up: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios

📊 Results

Comprehensive Multi-Domain Evaluation

We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.

Model	Video-Internal-Eval				Video-Internal-Eval-zh				Video-Internal-Eval-en				AISHELL4-Eval				Alimeeting-Far				AMI-SDM				MLC-SLM-Eval-1				MLC-SLM-Eval-2
Model	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓
Gemini-2.5-Pro	22.47	44.13	74.05	21.66	18.28	40.97	69.35	22.69	55.40	68.82	100.95	13.42	19.81	25.11	36.07	5.30	30.16	39.29	56.39	9.13	31.66	39.98	50.28	8.32	36.87	41.88	42.33	5.01	26.73	32.19	46.19	5.46
VibeVoice-ASR	16.45	58.60	47.18	42.15	17.70	62.06	47.65	44.36	7.11	32.65	44.62	25.54	22.19	26.16	8.94	3.97	34.31	39.92	19.62	5.61	30.53	35.86	21.00	5.33	10.30	13.45	6.27	3.15	7.97	11.38	3.14	3.41
Speaker-Reasoner Multi-turn	6.27	24.43	15.33	18.16	6.50	25.81	16.68	19.31	4.42	16.31	7.58	11.89	7.13	8.14	3.38	1.01	19.72	19.92	6.70	0.20	23.29	25.16	13.56	1.87	9.17	11.74	4.76	2.57	8.54	11.76	4.35	3.22

Segmented Evaluation (40–50s segments)

Model	AISHELL4-Eval				Alimeeting-Far
Model	DER↓	CER↓	cpCER↓	∆cp↓	DER↓	CER↓	cpCER↓	∆cp↓
Cascade Baselines
Pyannote3.1 + Paraformer	8.10	19.18	26.24	7.06	19.13	30.15	45.39	15.24
End-to-End Baselines
Gemini-2.5-Pro†	36.07	19.81	25.11	5.30	56.39	30.16	39.29	9.13
Qwen3-Omni-30B-A3B-Instruct	32.42	14.46	22.22	7.76	37.15	25.40	36.28	10.88
Qwen2.5-Omni-7B	85.68	33.37	60.45	27.08	91.77	38.13	73.38	35.25
SpeakerLM (212.25h)	–	17.75	26.14	8.39	–	18.63	32.22	13.59
SpeakerLM (7638.95h)	–	17.17	18.37	1.20	–	13.97	16.05	2.08
VibeVoice-ASR	10.88	22.30	26.30	4.00	20.70	34.67	40.54	5.87
TagSpeech-Alimeeting	37.51	35.70	53.44	17.74	52.46	47.11	68.74	21.63
Ours
Qwen3-Omni + SOT sft (Stage 1)	–	17.65	19.59	1.94	–	24.24	26.03	1.79
Speaker-Reasoner Base (Stage 1)	6.24	14.04	16.54	2.50	8.96	21.16	22.64	1.48
Speaker-Reasoner Multi-turn (Stage 2)	5.19	13.83	14.93	1.10	7.47	20.34	20.29	−0.05
Speaker-Reasoner Multi-turn w/ SAC (Stage 3)	5.26	13.83	14.73	0.90	7.34	20.57	20.43	−0.14
Speaker-Reasoner Base 7B	12.00	15.65	25.60	9.95	18.43	24.97	38.12	13.15
Speaker-Reasoner Multi-turn 7B	9.38	15.31	22.91	7.60	15.56	24.33	34.81	10.48

† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats.

Long-form Evaluation (without segmentation)

Model	AISHELL4-Eval DER↓	AISHELL4-Eval cpCER↓
Gemini-2.5-Pro	15.32	31.59
Speaker-Reasoner Multi-turn w/ SAC	21.60	36.20

Speaker Attribute Evaluation (AISHELL4-Eval)

Model	Gender ACC↑	Speaker Count ACC (SCA)↑
Gemini-2.5-Pro	94.80	67.03
Qwen3-Omni-30B-A3B-Instruct	97.12	60.49
Speaker-Reasoner Multi-turn	96.80	69.03

Installation

Environment Setup

git clone https://github.com/ASLP-lab/Speaker-Reasoner.git
cd Speaker-Reasoner

conda create -n speaker-reasoner python=3.10 -y
conda activate speaker-reasoner

Install MS-Swift and dependencies:

pip install ms-swift

Model Download

We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements:

Model Version	Description	Language	Download
Speaker-Reasoner	The standard multi-turn model evaluated in the main paper.	ZH	🤗 Hugging Face
Speaker-Reasoner-4194h	Scaled-up version trained on 4,194 hours of multi-domain data.	ZH/EN	🤗 Hugging Face

Training

Coming soon.

Inference

vLLM

Speaker-Reasoner is built on top of Qwen3-Omni-30B-A3B-Instruct. To run it, you will need to install a custom branch of vLLM from source.

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

For more details on compiling vLLM from source, refer to the vLLM official documentation.

Citation

If you find this work useful, please cite:

@article{lin2026speakerreasoner,
  title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR}, 
  author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie},
  year={2026},
  eprint={2604.03074},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2604.03074}, 
}

License

The code in this repository is released under the Apache 2.0 License.

Contact

Issues: Please open a GitHub Issue for bug reports or suggestions.
Email: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn

Downloads last month: 46

Safetensors

Model size

32B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ASLP-lab/Speaker-Reasoner

Speaker-Reasoner

Collection

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR • 2 items • Updated 4 days ago

Paper for ASLP-lab/Speaker-Reasoner

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Paper • 2604.03074 • Published 25 days ago

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†

🌟 Highlights

📊 Results

Comprehensive Multi-Domain Evaluation

Segmented Evaluation (40–50s segments)

Long-form Evaluation (without segmentation)

Speaker Attribute Evaluation (AISHELL4-Eval)

Installation

Environment Setup

Model Download

Training

Inference

vLLM

Citation

License

Contact

Collection including ASLP-lab/Speaker-Reasoner

Paper for ASLP-lab/Speaker-Reasoner

Zhennan Lin¹, Shuai Wang², Zhaokai Sun¹, Pengyuan Xie³, Chuan Xie³, Jie Liu³, Qiang Zhang³, Lei Xie^1†