Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximum number of speakers seen during training on MT-RNN-T performance on LibriCSS test set, and report 28% relative WER improvement over the two-speaker MS-RNN-T. Third, we experiment with a rich transcription strategy for joint recognition and segmentation of multi-party speech. Through an in-depth analysis, we discuss potential pitfalls of the proposed system as well as promising future research directions.
翻译:近来的研究显示,与模块系统相比,端到端(E2E)多发式ASR模型可以实现较高的识别准确度;然而,这些模型并不能确保实时适用性,因为其依赖全音频环境。这项工作将实时适用性作为模式设计的第一优先事项,并解决以往多声频经常性神经网络传输器(MS-RNNN-T)工作面临的一些挑战。首先,我们在培训期间采用空对空重叠语音模拟,在LibriSpeechMix测试集上产生14%相对字差错率(WER)的改进。第二,我们提议采用新的多音频-T(MT-RNNN-T)模型,并采用基于重叠的目标安排战略,在模式结构没有变化的情况下,将任意的发言人人数概括化。 我们调查了在培训中看到的最大人数对MT-RNNT-T在LibriCSS测试集中的表现产生的影响,在LibriSpe-RIS测试集中产生14%的相对字差差差差率的改进率。我们建议,将MNNERER 将M-S-S-reports-reports-travelation-regles-reglation (我们讨论) 的富有的双轨)联合研究-real-regal-real-real-regal-real-real-real-de ex-regal 分析,作为M-tramentaltramental-real-real-real-real-de-regildalmentaldal-real-de-de-real-de ex-real-de-real-real-de ex-de ex-real-tramental-refal-real-real-real-de-real-regal-real-real-real-real-real-regal-de-de-de-real-real-real-de-real-S-real-real-real-real-real-real-S-S-S-S-S-S-S-S-real-S-S-de ex。