Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks. However, Lattice-Free Maximum Mutual Information (LF-MMI), as one of the discriminative training criteria that show superior performance in hybrid ASR systems, is rarely adopted in E2E ASR frameworks. In this work, we propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages. The proposed approach shows its effectiveness on two of the most widely used E2E frameworks including Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements on various datasets and different E2E ASR frameworks. The best of our models achieves competitive CER of 4.1\% / 4.4\% on Aishell-1 dev/test set; we also achieve significant error reduction on Aishell-2 and Librispeech datasets over strong baselines.
翻译:最近,终端到终端(E2E)框架在各种自动语音识别(ASR)任务方面取得了显著成果,然而,作为显示混合自动识别系统优异性能的歧视性培训标准之一的无限制最大相互信息(LF-MMI),很少在E2E ASR框架中得到采用。在这项工作中,我们提出了在培训和解码阶段将LF-MMI标准纳入E2E ASR框架的新办法。拟议办法表明它在两个最广泛使用的E2E框架(包括基于注意的Ecorder-Decoders(AED)和NTs)上的有效性。实验表明,采用LF-MMI标准不断导致各种数据集和不同的E2E ASR框架的显著性能改进。我们的模型在Aishell-1 dev/stest 上实现了4.1 /4.4 ⁇ /4.4 ⁇ /dev/stest setc 具有竞争力的CER;我们还在Aishell-2和Lirispeech数据集方面实现了显著的错误减少。