Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.
翻译:最近,基于终端到终端的变换模型在许多领域取得了巨大成功,包括语音识别。然而,与LSTM模型相比,在推断过程中变换器的计算成本高昂是防止其应用的一个关键问题。在这项工作中,我们探讨了变换器转换器转换器(T-T)模型在大型数据集中低潜伏和快速解码的潜力。我们结合了变换器-XL和块状流动处理的想法,设计了一个流动变换器转换器模型。我们证明T-T比混合模型RNNN Transduker(RNN-T)和流式变换器重心解码模型(RNNN-T)和流式变换器编码器编码模型在流动情景中的效果要好。此外,运行成本和拉动器的精度可以用相对小的外观来优化。