End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 150 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.
翻译:由于在诸如谈话和会议抄录等应用中具有巨大潜力,终端到终端多讲者语音识别是一个新兴的语音社区研究趋势。据我们所知,所有现有研究工作都受到离线情景的限制。在这项工作中,我们提议将流动混合和识别转换器(SURT)用于终端到终端多讲者语音识别。我们的模型使用经常性神经网络转换器(RNNN-T)作为能够满足各种延缓限制的骨干。我们研究了两种不同的模型结构,分别基于一个语音差异编码器和一个掩码编码器。为了培训这一模型,我们调查了广泛使用的变异性培训(PIT)方法和超常性错误分配培训(HEAT)方法。根据公开提供的LibriSpeechMix数据集的实验,我们表明,与PIT相比,HAT可以实现更好的准确性,而SURT模型则具有150毫秒的测算宽度测量力限制,而与离线序列序列基线模型的精确性相比,我们调查了该模型。