对多党演讲进行逐流承认的多党演讲的多回合RNN-T (Multi-turn RNN-T for streaming recognition of multi-party speech)

Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximum number of speakers seen during training on MT-RNN-T performance on LibriCSS test set, and report 28% relative WER improvement over the two-speaker MS-RNN-T. Third, we experiment with a rich transcription strategy for joint recognition and segmentation of multi-party speech. Through an in-depth analysis, we discuss potential pitfalls of the proposed system as well as promising future research directions.

翻译：近来的研究显示,与模块系统相比,端到端(E2E)多发式ASR模型可以实现较高的识别准确度;然而,这些模型并不能确保实时适用性,因为其依赖全音频环境。这项工作将实时适用性作为模式设计的第一优先事项,并解决以往多声频经常性神经网络传输器(MS-RNNN-T)工作面临的一些挑战。首先,我们在培训期间采用空对空重叠语音模拟,在LibriSpeechMix测试集上产生14%相对字差错率(WER)的改进。第二,我们提议采用新的多音频-T(MT-RNNN-T)模型,并采用基于重叠的目标安排战略,在模式结构没有变化的情况下,将任意的发言人人数概括化。我们调查了在培训中看到的最大人数对MT-RNNT-T在LibriCSS测试集中的表现产生的影响,在LibriSpe-RIS测试集中产生14%的相对字差差差差率的改进率。我们建议,将MNNERER 将M-S-S-reports-reports-travelation-regles-reglation (我们讨论) 的富有的双轨)联合研究-real-regal-real-real-regal-real-real-real-de ex-regal 分析,作为M-tramentaltramental-real-real-real-real-de-regildalmentaldal-real-de-de-real-de ex-real-de-real-real-de ex-de ex-real-tramental-refal-real-real-real-de-real-regal-real-real-real-real-real-regal-de-de-de-real-real-real-de-real-S-real-real-real-real-real-real-S-S-S-S-S-S-S-S-real-S-S-de ex。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日