Recurrent neural network transducers (RNN-T) are a promising end-to-end speech recognition framework that transduces input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance (measured by word error rate (WER) in general), most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention layers for Conformer-based encoder networks, which can exploit local and generalized global information by pruning most of the in-domain fitted global connections. Further, we propose a state reset method for the generalization of the prediction network to cope with long-form utterances. Applying proposed methods to an out-domain test, we obtained 24.6\% and 6.5\% relative character error rate (CER) reduction compared to the fully connected and local self-attention layer-based Conformers, respectively.
翻译:常规神经网络传输器( RNN- T) 是一个充满希望的端对端语音识别框架, 它将声学框架输入到字符序列中。 RNN- T 的最新编码器网络是Confer, 它可以通过它的相变层和自留层来有效地模拟当地- 全球背景信息。 虽然 Confold RNN- T 显示出杰出的性能( 通常用字差率衡量 ), 但大多数研究都是在从同一领域提取电动和测试数据的环境下进行核实的。 Conex RNN- T 的域错配问题尚未深入调查,这是产品级语音识别系统的一个重要问题。 在这项研究中,我们发现Conferd 完全连接的自留层造成高的删除错误, 特别是在长式外出场话中。 为了解决这个问题, 我们为基于Confect- 的电离子网络引入了稀薄的自留层级层层层, 通过将大多数内部的相对连通性全球链接进行本地和通用的全球信息。 我们分别提出了一个自存的自存的自存的自存的自存式网络, 和自存的自存的自存的自存式网络, 与自存的自存的自存的自存的自存的自存的自存的自存的自存到自存的自存的自存的自存的自存方法。