This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model. Streaming RNN-T trained from scratch usually performs worse than non-streaming RNN-T. Although it is common to address this issue through pre-training components of RNN-T with other criteria or frame-wise alignment guidance, the alignment is not easily available in end-to-end manner. In this work, frame-wise alignment, used to pre-train streaming RNN-T's encoder, is generated without using a HMM-based system. Therefore an all-neural framework equipping HMM-free encoder pre-training is constructed. This is achieved by expanding the spikes of CTC model to their left/right blank frames, and two expanding strategies are proposed. To our best knowledge, this is the first work to simulate HMM-based frame-wise label using CTC model for pre-training. Experiments conducted on LibriSpeech and MLS English tasks show the proposed pre-training procedure, compared with random initialization, reduces the WER by relatively 5%~11% and the emission latency by 60 ms. Besides, the method is lexicon-free, so it is friendly to new languages without manually designed lexicon.
翻译:这项工作描述了一种使用框架标签的编码器培训前程序。 使用框架标签改进流中神经网络导体( RNNN- T) 模式的培训。 从零开始培训的 RNNN- T 的全新框架通常比非流中 RNNN- T 的预培训部分表现更差。 虽然通过RNNT-T 的预培训部分和其他标准或框架-框架- 匹配指南来解决这一问题是常见的, 但根据我们的最佳知识, 这是使用气候技术预培训模式模拟基于 HMM 的框架标签的首份工作。 在LibriSpeech 和 MLS- English 任务上进行的预培训实验没有使用基于 HMMM 的系统。 因此, 安装HMM- 免费的编码器前培训的全新框架通常比非流中 RNNNNT- T 的预培训要差。 这是通过将气候技术模型的峰值扩大至左/ 右空框和两个扩展战略来解决这个问题。 根据我们的最佳知识, 这是使用气候模型模拟基于 HMMMMMMMMMMM的框架的首项模拟框架标签进行模拟的模拟培训。 在LS- spee- sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,, 将使用新的Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx