Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
Abstract
Full-duplex speech language models require high-quality multi-speaker conversational data, which is scarce, necessitating a robust open-source data processing pipeline to address challenges in natural dialogue dynamics and system accuracy.
As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.
Community
A full-duplex system allows users to interrupt the LLM at any time, and the LLM can also naturally chime in and respond to what we say. This is an area currently being actively researched in the speech domain, and we expect it to expand into other fields in the future.
We have proposed a pipeline for pre-processing full-duplex data based on real-world datasets.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper