Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped speech from conversations or meetings with an all-neural model in a streaming fashion, which is fundamentally different from a modular-based approach that usually cascades the speech separation and the speech recognition models trained independently. Previously, we proposed the Streaming Unmixing and Recognition Transducer (SURT) model based on recurrent neural network transducer (RNN-T) for this problem and presented promising results. However, for real applications, the speech recognition system is also required to determine the timestamp when a speaker finishes speaking for prompt system response. This problem, known as endpoint (EP) detection, has not been studied previously for multi-talker end-to-end models. In this work, we address the EP detection problem in the SURT framework by introducing an end-of-sentence token as an output unit, following the practice of single-talker end-to-end models. Furthermore, we also present a latency penalty approach that can significantly cut down the EP detection latency. Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.
翻译:发送端对端多讲者语音识别,目的是用流式方式将对话或会议与全神经模式的谈话或会议重叠的演讲内容与全神经模式进行校正,这与通常将语音分离和语音识别模式单独培训的模块化方法有根本的区别。以前,我们提议了基于经常性神经网络传输器(RNN-T)的流到端多讲者语音识别模型(SURT)来解决这个问题,并提出了有希望的结果。然而,对于真正的应用,还需要语音识别系统来确定演讲者为迅速系统响应而发言的时间间隔。这个问题被称为端点(EP)检测,以前还没有研究过多对话者端到端模式。在这项工作中,我们根据单一谈话器端对端对端模式的做法,在SURT框架内将EP检测问题作为输出单位处理。此外,我们还提出了一种长期处罚方法,可以大幅削减EP检测的定位值。我们称之为端点(EPEP)检测器的检测,以前还没有研究过多对端到端模式的检测。我们实验性结果可以显著地显示二号的降解数据。