Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. We first observe that conventional SSL techniques do not work well on this task due to the poor representation of overlapping speech. We then propose a novel SSL training objective, referred to as bi-label masked speech prediction, which explicitly preserves representations of all speakers in overlapping speech. We investigate various aspects of the proposed system including data configuration and quantizer selection. The proposed SSL setup achieves substantially better word error rates on the LibriSpeechMix dataset.
翻译:自我监督的学习(SSL)利用投入数据本身进行代表学习,取得了各种下游演讲任务的最新成果;然而,以往的大多数研究都侧重于离线单一对话者应用,对多对话者案例的调查有限,特别是流传情景。在本文中,我们调查SSL,以流传多对话者语音识别,以流传方式生成重复发言者的抄录。我们首先发现,常规的SSL技术在这项任务上效果不佳,因为重复发言的表述很少。我们随后提出了一个新的 SSL 培训目标,称为双标签的隐性演讲预测,明确保留了重叠演讲中所有发言者的表述。我们调查了拟议系统的各个方面,包括数据配置和四分录选择。拟议的SSLisl设置在LibSpeechMix数据集上实现大大改进了字词出错率。