Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this study, we propose to distill the knowledge of BERT for CTC-based ASR, extending our previous study for attention-based ASR. CTC-based ASR learns the knowledge of BERT during training and does not use BERT during testing, which maintains the fast inference of CTC. Different from attention-based models, CTC-based models make frame-level predictions, so they need to be aligned with token-level predictions of BERT for distillation. We propose to obtain alignments by calculating the most plausible CTC paths. Experimental evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 show that our method improves the performance of CTC-based ASR without the cost of inference speed.
翻译:语言模型(LM)整合方法,如浅质聚合和重新校准等,能够利用文本库中的知识,提高基于四氯化碳的ASR的识别准确性;然而,这些模型大大减缓了四氯化碳的推断力;在这项研究中,我们提议为基于四氯化碳的ASR提取BERT的知识,扩大我们先前的研究,以关注为基础的ASR。 以四氯化碳为基础的ASR在培训中学习BERT的知识,在测试中不使用BERT,这种测试维持了四氯化碳的快速推断力。与基于关注的模型不同,基于四氯化碳的模式作出框架一级的预测,因此它们需要与BERT的象征性预测保持一致,以便蒸馏。我们提议通过计算最有说服力的CTC路径来取得一致。我们对日本自广的Corpus和TED-LIUM2的实验性评价表明,我们的方法在不考虑速度的情况下改进基于四氯化碳的ASR的绩效。