End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the recently introduced Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 120 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.
翻译:终端到终端多讲者语音识别是一个新兴的语音社区研究趋势,因为它在诸如谈话和会议抄录等应用方面有着巨大的潜力。 据我们所知,所有现有研究工作都受到离线情景的限制。在这项工作中,我们提议将流动混合和识别转换器(SURT)用于终端到终端多讲者语音识别。我们的模型使用经常性神经网络转换器作为骨干,可以满足各种潜伏限制。我们分别研究了两个不同的模型结构,它们基于一个演讲者-差异编码器和面具编码器。为了培训这一模型,我们研究了广泛使用的变异性培训(PIT)方法和最近推出的超常错误分配培训(HEAT)方法。根据对公众可得到的LibriSpeechMix数据集的实验,我们表明,与PIT相比,HAT可以实现更好的准确性,而SURT模型的算算法限制则与离线序列到序列基线的精确性模型相比较。