Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.
翻译:序列传送器,如RNN-T和Coned-T,是最有希望的终端到终端语音识别模式之一,特别是在潜伏和准确性都很重要的流式情景中。虽然已经研究过各种方法,例如调整限制培训和快速 Emit,以减少延缓率,但延缓率往往伴随着显著的精确度下降。我们争辩说,这种次优性性能可能是由于没有先前的任何方法明确模型和缩短延缓度而导致的。在本文中,我们提出了一个新的培训方法,以明确建模和减少顺序转换器模型的延缓度。首先,我们界定了每条对角线的预期延缓度,例如调整限制培训和快速Emmit,表明其梯度可以在前向后算法中有效计算。然后,我们用这一预期的延缓度来增加过量损失,从而实现延缓和精确度之间的最佳交易。WSJ数据集的实验结果表明,拟议的最低延缓性培训降低了拉长度培训在拉蒂的每条线上的延度。