The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.
翻译:说话人日志与识别任务旨在预测音频片段中“谁在何时说了什么”,这是会议转录和对话系统等多种现实多说话人场景中的关键任务。现有的说话人日志与识别系统通常采用级联框架,结合说话人日志和自动语音识别等多个模块。此类级联系统存在若干局限性,例如错误传播、处理重叠语音困难,以及缺乏联合优化以探索说话人日志与自动语音识别任务间的协同效应。为解决这些局限,我们提出了SpeakerLM,一个用于说话人日志与识别的统一多模态大语言模型,能够以端到端方式联合执行说话人日志与自动语音识别任务。此外,为适应多样化的现实场景,我们在SpeakerLM中引入了灵活的说话人注册机制,使其能够在不同的说话人注册设置下完成说话人日志与识别任务。SpeakerLM通过在大规模真实数据上采用多阶段训练策略逐步开发而成。大量实验表明,SpeakerLM展现出强大的数据扩展能力和泛化性,在领域内和领域外的公开说话人日志与识别基准测试中均优于当前最先进的级联基线系统。此外,实验结果证明,所提出的说话人注册机制能有效确保SpeakerLM在不同说话人注册条件和不同注册说话人数量下均保持稳健的说话人日志与识别性能。