This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulations, such as the attention-based encoder-decoder and transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels. However, CTC generally yields inferior performance than the autoregressive models due to the conditional independence assumption, thereby limiting the performance of MPL. We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling. Specifically, we focus on self-conditional and hierarchical conditional CTC, that apply auxiliary CTC losses to intermediate layers such that the conditional independence assumption is explicitly relaxed. We also explore how pseudo-labels should be generated and used as supervision for intermediate losses. Experimental results in different semi-supervised settings demonstrate that the proposed approach outperforms MPL and improves an ASR model by up to a 12.1% absolute performance gain. In addition, our detailed analysis validates the importance of the intermediate loss.
翻译:本文提出了InterMPL,一种半监督的端到端自动语音识别(ASR)方法,它使用中间监督来进行伪标签(PL)训练。 动量PL(MPL)通过不断生成伪标签并提高其质量,对未标记数据训练基于连接主义时间分类(CTC)的模型。与自回归公式,如基于注意力的编码器-解码器和转换器相比,由于其简单/快速的推理算法和抗生成折叠标签的鲁棒性,CTC非常适合于MPL或基于PL的半监督ASR,但是由于条件独立性假设,CTC通常产生低于自回归模型的性能,从而限制了MPL的性能。我们提出通过引入中间损失增强MPL,借鉴了CTC建模的最新发展。具体来说,我们专注于自条件和分层条件CTC,它们对中间层应用辅助CTC损失,使得条件独立性假设得到显式放松。我们还探讨了应如何生成和使用伪标签作为中间损失的监督。在不同的半监督设置中进行的实验结果表明,所提出的方法优于MPL,并将ASR模型的性能提高了最多12.1%的绝对性能增益。另外,我们的详细分析验证了中间损失的重要性。