In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications. Since overlapped speech is common in this case, conventional approaches usually address this problem in a cascaded fashion that involves speech separation, speech recognition and speaker identification that are trained independently. In this paper, we propose Streaming Unmixing, Recognition and Identification Transducer (SURIT) -- a new framework that deals with this problem in an end-to-end streaming fashion. SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the LibrispeechMix dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
翻译:在诸如会议和对话等多对话情景中,通常要求语音处理系统为下游应用程序转录音频和识别发言者。由于在此情况下重叠的演讲很常见,常规方法通常以连锁方式解决这一问题,包括语音分离、语音识别和语音识别,这些都经过独立培训。在本文中,我们提议以端到端流方式处理该问题的新框架“SURIT”。SURIT使用经常性神经网络传输器(RNN-T)作为语音识别和语音识别的骨干。我们验证了我们在LibrispeechMix数据集上的想法,这是一个来自Librispeech的多对话数据集,并展示了令人鼓舞的结果。