End-to-end approaches have drawn much attention recently for significantly simplifying the construction of an automatic speech recognition (ASR) system. RNN transducer (RNN-T) is one of the popular end-to-end methods. Previous studies have shown that RNN-T is difficult to train and a very complex training process is needed for a reasonable performance. In this paper, we explore RNN-T for a Chinese large vocabulary continuous speech recognition (LVCSR) task and aim to simplify the training process while maintaining performance. First, a new strategy of learning rate decay is proposed to accelerate the model convergence. Second, we find that adding convolutional layers at the beginning of the network and using ordered data can discard the pre-training process of the encoder without loss of performance. Besides, we design experiments to find a balance among the usage of GPU memory, training circle and model performance. Finally, we achieve 16.9% character error rate (CER) on our test set which is 2% absolute improvement from a strong BLSTM CE system with language model trained on the same text corpus.
翻译:端对端方法最近引起人们的极大注意,以大大简化自动语音识别系统的构建。 RNNT 传感器(RNNN-T)是流行的端对端方法之一。以前的研究表明,RNN-T很难培训,需要一个非常复杂的培训过程来进行合理的表现。在本文件中,我们探索了中文本词汇连续语音识别(LVCSR)任务,目的是简化培训过程,同时保持性能。首先,提出了新的学习率衰减战略,以加速模式趋同。第二,我们发现在网络开始时添加卷动层,并使用定购的数据可以放弃编码器的预培训过程,而不会丧失性能。此外,我们设计了实验,以便在使用GPU记忆、培训圈和模型性能之间找到平衡点。最后,我们在测试集上实现了16.9%的性格错误率(CER),该率从强大的BLSTM CE系统得到2%的绝对改进,而语言模型是在同一文本体上培训的。