Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset.
翻译:在语言处理任务中,特别是在单对讲者应用程序中,学习未经明确监督的数据表现的自我监督学习方法在语音处理任务中越来越受欢迎。然而,这些模型往往降低了多讲者情景的性能 -- -- 这可能是由于域错配 -- -- 严重限制了对此类应用的使用。在本文件中,我们调查了上游SSL模型在两种条件下适应多讲者自动语音识别任务的情况。首先,当给出了分解的语句时,我们表明,基于注册嵌入的标语提取模块是混合觉悟预培训的补充。第二,对于未分解的混合物,我们建议采用新型的联合语音模型(JSM)方法,该方法通过嵌入将混合物中所有发言者的信息汇总在一起。在Libri2Mix上进行的有控制的实验显示,使用语音嵌入为分解和未分解案例的强基线分别提供了9.1%和42.1%的相对WER改进率。我们还展示了我们通过AMI数据集实验实现真实对话混合物模型的有效性。