We present a training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to learn to exploit context audio from a stream, using segmented or partially labeled sequences of the stream during training. We show that the use of context audio during training and inference can lead to word error rate reductions of more than 6% in a realistic production setting for a voice assistant ASR system. We investigate the effect of the proposed training approach on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation. To gain further insight into the ability of a long short-term memory (LSTM) based ASR encoder to exploit long-term context, we also visualize RNN-T loss gradients with respect to the input.
翻译:我们提出了一个基于经常性神经网络传感器(RNN-T)的自动语音识别流(ASR)培训计划,使编码器网络能够在培训期间使用流流的分段或部分标签序列,学习利用流流的上下文音频;我们表明,在培训和推断过程中使用背景音频可导致音频助理ASR系统在现实的制作环境中出现字差率下降6%以上;我们调查了拟议培训方法对包含背景演讲和当前数据点的具有声学挑战性的数据的影响,这些数据点表明,这一方法有助于网络学习扬声器和环境适应;为了进一步了解基于ASR编码器的长期短期内存(LSTM)利用长期环境的能力,我们还将输入的RNN-T损失梯度直观化。