We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from a co-trained connectionist temporal classification (CTC) model. We made a key assumption that if an encoder embedding frame is classified as a blank frame by the CTC model, it is likely that this frame will be aligned to blank for all the partial alignments or hypotheses in RNN-T and it can be discarded from the decoder input. We also show that this frame reduction operation can be applied in the middle of the encoder, which result in significant speed up for the training and inference in RNN-T. We further show that the CTC alignment, a by-product of the CTC decoder, can also be used to perform lattice reduction for RNN-T during training. Our method is evaluated on the Librispeech and SpeechStew tasks. We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER).
翻译:我们根据联合培训的连接器时间分类(CTC)模型的指导,提出了加速经常性神经网络传感器(RNNN-T)的培训和推断过程的新方法。我们提出一个关键假设,即如果一个编码器嵌入框架被CTC模型归类为空白框架,那么这个框架很可能对准成空白,用于RNN-T中所有部分对齐或假设,并且可以从解码器输入中丢弃。我们还表明,这一框架缩减操作可以在编码器中间应用,从而大大加快RNNN-T的培训和推断。我们进一步表明,作为CTC的副产品之一的CT校准也可以用于在培训期间对RNN-T进行减压。我们的方法是对Librispeech和SpeesStew任务进行评估。我们证明,拟议的方法能够以类似或稍好字错误率加速2.2倍的RNNT推断。