In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to segment the audio and then transcribe each segmentation. These two stages are addressed separately by speaker change detection (SCD) and automatic speech recognition (ASR). Most previous SCD systems rely solely on speaker information and ignore the importance of speech content. In this paper, we propose a novel SCD system that considers both cues of speaker difference and speech content. These two cues are converted into token-level representations by the continuous integrate-and-fire (CIF) mechanism and then combined for detecting speaker changes on the token acoustic boundaries. We evaluate the performance of our approach on a public real-recorded meeting dataset, AISHELL-4. The experiment results show that our method outperforms a competitive frame-level baseline system by 2.45% equal coverage-purity (ECP). In addition, we demonstrate the importance of speech content and speaker difference to the SCD task, and the advantages of conducting SCD on the token acoustic boundaries compared with conducting SCD frame by frame.
翻译:在诸如会议和谈话等多对话场景中,通常要求语音处理系统对音频进行分解,然后对每个区段进行剪辑。这两个阶段分别通过语音变换检测(SCD)和自动语音识别(ASR)分别处理。大多数先前的SCD系统都完全依赖语音信息,忽视了语音内容的重要性。在本文中,我们提出一个新的SCD系统,既考虑语音差异的提示,也考虑语音内容。这两个提示通过连续的集成和射击(CIF)机制转换为象征性的表示,然后结合用于探测象征性声学边界上的语音变化。我们评估了我们在公开真实记录的会议数据集(AISHELL-4)上的做法的绩效。实验结果表明,我们的方法比竞争性框架级基线系统(ECP)高出了2.45%。 此外,我们展示了语音内容和语音差异对于SCD任务的重要性,以及在象征性声学边界上进行SCD与按框架进行SCD框架进行对照的好处。