YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†

Corresponding author

1Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2School of Intelligence Science and Technology, Nanjing University
3Shanghai Lingguang Zhaxian Technology


Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.

🌟 Highlights

  • Agentic multi-turn reasoning: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding
  • Speaker-aware context cache: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks
  • Three-stage progressive training: multi-task foundation → temporal interaction learning → cache-conditioned decoding
  • State-of-the-art performance: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4
  • 🔥 Bilingual & Scaled up: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios

📊 Results

Comprehensive Multi-Domain Evaluation

We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.

Model Video-Internal-Eval Video-Internal-Eval-zh Video-Internal-Eval-en AISHELL4-Eval Alimeeting-Far AMI-SDM MLC-SLM-Eval-1 MLC-SLM-Eval-2
WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓
Gemini-2.5-Pro 22.4744.1374.0521.66 18.2840.9769.3522.69 55.4068.82100.9513.42 19.8125.1136.075.30 30.1639.2956.399.13 31.6639.9850.288.32 36.8741.8842.335.01 26.7332.1946.195.46
VibeVoice-ASR 16.4558.6047.1842.15 17.7062.0647.6544.36 7.1132.6544.6225.54 22.1926.168.943.97 34.3139.9219.625.61 30.5335.8621.005.33 10.3013.456.273.15 7.9711.383.143.41
Speaker-Reasoner Multi-turn 6.2724.4315.3318.16 6.5025.8116.6819.31 4.4216.317.5811.89 7.138.143.381.01 19.7219.926.700.20 23.2925.1613.561.87 9.1711.744.762.57 8.5411.764.353.22

Segmented Evaluation (40–50s segments)

Model AISHELL4-Eval Alimeeting-Far
DER↓CER↓cpCER↓∆cp↓ DER↓CER↓cpCER↓∆cp↓
Cascade Baselines
Pyannote3.1 + Paraformer8.1019.1826.247.0619.1330.1545.3915.24
End-to-End Baselines
Gemini-2.5-Pro†36.0719.8125.115.3056.3930.1639.299.13
Qwen3-Omni-30B-A3B-Instruct32.4214.4622.227.7637.1525.4036.2810.88
Qwen2.5-Omni-7B85.6833.3760.4527.0891.7738.1373.3835.25
SpeakerLM (212.25h)17.7526.148.3918.6332.2213.59
SpeakerLM (7638.95h)17.1718.371.2013.9716.052.08
VibeVoice-ASR10.8822.3026.304.0020.7034.6740.545.87
TagSpeech-Alimeeting37.5135.7053.4417.7452.4647.1168.7421.63
Ours
Qwen3-Omni + SOT sft (Stage 1)17.6519.591.9424.2426.031.79
Speaker-Reasoner Base (Stage 1)6.2414.0416.542.508.9621.1622.641.48
Speaker-Reasoner Multi-turn (Stage 2)5.1913.8314.931.107.4720.3420.29−0.05
Speaker-Reasoner Multi-turn w/ SAC (Stage 3)5.2613.8314.730.907.3420.5720.43−0.14
Speaker-Reasoner Base 7B12.0015.6525.609.9518.4324.9738.1213.15
Speaker-Reasoner Multi-turn 7B9.3815.3122.917.6015.5624.3334.8110.48

† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats.

Long-form Evaluation (without segmentation)

Model AISHELL4-Eval DER↓ AISHELL4-Eval cpCER↓
Gemini-2.5-Pro15.3231.59
Speaker-Reasoner Multi-turn w/ SAC21.6036.20

Speaker Attribute Evaluation (AISHELL4-Eval)

Model Gender ACC↑ Speaker Count ACC (SCA)↑
Gemini-2.5-Pro94.8067.03
Qwen3-Omni-30B-A3B-Instruct97.1260.49
Speaker-Reasoner Multi-turn96.8069.03

Installation

Environment Setup

git clone https://github.com/ASLP-lab/Speaker-Reasoner.git
cd Speaker-Reasoner

conda create -n speaker-reasoner python=3.10 -y
conda activate speaker-reasoner

Install MS-Swift and dependencies:

pip install ms-swift

Model Download

We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements:

Model Version Description Language Download
Speaker-Reasoner The standard multi-turn model evaluated in the main paper. ZH 🤗 Hugging Face
Speaker-Reasoner-4194h Scaled-up version trained on 4,194 hours of multi-domain data. ZH/EN 🤗 Hugging Face

Training

Coming soon.

Inference

vLLM

Speaker-Reasoner is built on top of Qwen3-Omni-30B-A3B-Instruct. To run it, you will need to install a custom branch of vLLM from source.

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

For more details on compiling vLLM from source, refer to the vLLM official documentation.

Citation

If you find this work useful, please cite:

@article{lin2026speakerreasoner,
  title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR}, 
  author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie},
  year={2026},
  eprint={2604.03074},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2604.03074}, 
}

License

The code in this repository is released under the Apache 2.0 License.

Contact

Downloads last month
46
Safetensors
Model size
32B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ASLP-lab/Speaker-Reasoner

Paper for ASLP-lab/Speaker-Reasoner