Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively.
翻译:最近对流传自动语音识别(ASR)经常性神经网络传输器(RNN-T)系统进行的研究为源流自动语音识别(ASR)经常性神经网络传输器(RNN-T)系统提供了过去背景信息的编码器,以便改进其单差率(WER)性能。在本文件中,我们首先建议了一种背景通勤培训技术,利用先前和未来背景语句来对演讲者、主题和声学环境进行隐含的适应性调整。我们还建议了一种双模式背景通勤培训技术,用于流传自动语音识别(ASR)系统。这一拟议方法通过“就地”提炼能够看到过去和将来背景语句的学生知识,从而在流动模型中更好地利用现有声学环境环境环境环境环境环境环境,而该教师只能够看到当前和过去背景语句,从而对演讲者、主题和声音环境环境环境环境环境进行隐含的调整。实验结果表明,经过培训的兼容式传感器系统比经过经典 RNNNE-T损失(ASR)系统的系统更完美。具体地说,拟议的技术能够分别减少WER和平均最后象征性排放时间,分别比6%和40米的相对。