Consistency regularization has recently been applied to semi-supervised sequence-to-sequence (S2S) automatic speech recognition (ASR). This principle encourages an ASR model to output similar predictions for the same input speech with different perturbations. The existing paradigm of semi-supervised S2S ASR utilizes SpecAugment as data augmentation and requires a static teacher model to produce pseudo transcripts for untranscribed speech. However, this paradigm fails to take full advantage of consistency regularization. First, the masking operations of SpecAugment may damage the linguistic contents of the speech, thus influencing the quality of pseudo labels. Second, S2S ASR requires both input speech and prefix tokens to make the next prediction. The static prefix tokens made by the offline teacher model cannot match dynamic pseudo labels during consistency training. In this work, we propose an improved consistency training paradigm of semi-supervised S2S ASR. We utilize speech chain reconstruction as the weak augmentation to generate high-quality pseudo labels. Moreover, we demonstrate that dynamic pseudo transcripts produced by the student ASR model benefit the consistency training. Experiments on LJSpeech and LibriSpeech corpora show that compared to supervised baselines, our improved paradigm achieves a 12.2% CER improvement in the single-speaker setting and 38.6% in the multi-speaker setting.
翻译:最近对半监督序列到序列自动语音识别(ASR)应用了一致性规范化。这一原则鼓励采用ASR模型,对不同扰动的同一投入演讲进行类似的预测。现有的半监督S2S ASR模式将分解作为数据增强功能,要求采用静态教师模型,为未受调解的演讲制作假冒录音誊本。然而,这一模式未能充分利用一致性规范化。首先,SpecAugation(S2S)的遮掩操作可能会损害演讲的语言内容,从而影响假标签的质量。第二,S2S ASR需要投入演讲和预设符号来作出下一个预测。离线教师模型的静态前缀无法匹配一致性培训期间的动态假标签。在这项工作中,我们建议改进半监督S2S 38ASR(ASR)的一致化培训模式。我们利用语音链重建作为弱的强化生成高品质的假标签。此外,我们证明,在学生ALSR(IS)的模型设定的动态假冒记录中,S-chrechech 将ADR(LPER)的模型比照标准化为ADLVS)。