The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR). To estimate the output distributions over subword units, RNN-T uses a fully connected layer as the joint network to fuse the acoustic representations extracted using the acoustic encoder with the text representations obtained using the prediction network based on the previous subword units. In this paper, we propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations to feed into the output layer. A regularisation method is also proposed to enable better acoustic encoder training by reducing the gradients back-propagated into the prediction network at the beginning of RNN-T training. Experimental results on a multilingual ASR setting for voice search over nine languages show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.
翻译:经常性神经网络传感器(RNN-T)最近已成为流动自动语音识别(ASR)的主流端对端方法。为了估计子词单位的产出分布,RNN-T使用一个完全连接的层作为联合网络,将使用音响编码器提取的声学演示与根据前子词单位的预测网络获得的文本演示结合起来。在本文中,我们提议使用格化、双线聚合和在联合网络中将其组合,以产生更清晰的表达式,供输入输出层。还提议了一种常规化方法,通过减少在RNN-T培训开始时反馈到预测网络的梯度,从而进行更好的声学编码器培训。用于9种语言语音搜索的多语种语音显示,联合使用拟议方法可以减少4%-5%的相对单词错误率,而仅增加几百万个参数。