This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulations, such as the attention-based encoder-decoder and transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels. However, CTC generally yields inferior performance than the autoregressive models due to the conditional independence assumption, thereby limiting the performance of MPL. We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling. Specifically, we focus on self-conditional and hierarchical conditional CTC, that apply auxiliary CTC losses to intermediate layers such that the conditional independence assumption is explicitly relaxed. We also explore how pseudo-labels should be generated and used as supervision for intermediate losses. Experimental results in different semi-supervised settings demonstrate that the proposed approach outperforms MPL and improves an ASR model by up to a 12.1% absolute performance gain. In addition, our detailed analysis validates the importance of the intermediate loss.
翻译:本文展示了InterMPL, 这是一种半监督的终端到终端自动语音识别(ASR)的学习方法, 以中间监管方式进行假标签自动识别( PL) 。 Momentum PL (MPL) 通过在飞上不断生成假标签并提高其质量, 培养基于无标签数据的连接性时间分类模式(CTC) 。 与基于关注的编码解码器和传输器等自动递减式配方相比, CT非常适合MPL 或基于PL 的半监督性语音识别( ASR), 因为它具有简单/ 快速的推断算法, 并且对于生成崩溃标签的标签非常有力。 但是, CTPL 通常比基于有条件的独立假设的自动递增模型(CTC) 产生低效, 从而限制MPL 的性能。 我们建议引入中度损失模式, 如基于CTC 模型的最近的进展, 我们专注于自制和绝对级化的 CTC, 将辅助性CTC 损失应用到中间层, 例如, 有条件的独立假设是明确放松的。 我们还探索了 IML IMBA 的模型的模型中度分析, 应该如何产生和 IMBRA 。