We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification (CTC) objective. The proposed objective, an intermediate CTC loss, is attached to an intermediate layer in the CTC encoder network. This intermediate CTC loss well regularizes CTC training and improves the performance requiring only small modification of the code and small and no overhead during training and inference, respectively. In addition, we propose to combine this intermediate CTC loss with stochastic depth training, and apply this combination to a recently proposed Conformer network. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively, based on CTC greedy search without a language model. Especially, the AISHELL-1 task is comparable to other state-of-the-art ASR systems based on auto-regressive decoder with beam search.
翻译:我们根据连接器时间分类(CTC)目标,提出了自动语音识别的简单而有效的辅助损失功能(ASR),拟议的目标,即中间的CTC损失,附在CTC编码器网络的中间层,这种中间的CTC损失使CTC培训规范化,并改进要求只需对代码作小改动、在培训和推理过程中小而无间接费用的性能,此外,我们提议将这种中间的CTC损失与随机深度培训结合起来,并将这种结合适用于最近提出的Confect 网络。我们根据CTC的贪婪搜索,对各种公司的拟议方法进行评估,在ASHELL-1系统中分别达到WSJ文体和字符错误率(CER)的9.9%。特别是,ASHELL-1号任务与其他基于自动递增分解的ASR系统以及波纹搜索类似。