Recent studies have shown that the benefits provided by self-supervised pre-training and self-training (pseudo-labeling) are complementary. Semi-supervised fine-tuning strategies under the pre-training framework, however, remain insufficiently studied. Besides, modern semi-supervised speech recognition algorithms either treat unlabeled data indiscriminately or filter out noisy samples with a confidence threshold. The dissimilarities among different unlabeled data are often ignored. In this paper, we propose Censer, a semi-supervised speech recognition algorithm based on self-supervised pre-training to maximize the utilization of unlabeled data. The pre-training stage of Censer adopts wav2vec2.0 and the fine-tuning stage employs an improved semi-supervised learning algorithm from slimIPL, which leverages unlabeled data progressively according to their pseudo labels' qualities. We also incorporate a temporal pseudo label pool and an exponential moving average to control the pseudo labels' update frequency and to avoid model divergence. Experimental results on Libri-Light and LibriSpeech datasets manifest our proposed method achieves better performance compared to existing approaches while being more unified.
翻译:最近的研究显示,自监督的训练前和自训练(假贴标签)所提供的好处是相辅相成的。但是,在训练前框架下的半监督的微调战略仍然没有得到充分的研究。此外,现代半监督的语音识别算法要么不加区别地处理未贴标签的数据,要么用信任阈值筛选出吵杂的样本。不同的未贴标签数据往往被忽略。在本文中,我们提议采用Cener,即半监督的语音识别算法,这种半监督的语音识别算法,以尽量利用未贴标签的数据。Cener在训练前采用 wav2vec2.0 的半监督的微调阶段,以及微调阶段采用改进的半监督的学习算法,利用未贴标签的数据逐渐与假标签的品质相匹配。我们还加入了一个临时假标签库和一个指数移动平均数,以控制假标签的更新频率和避免模型差异。Libri和LibSpeech数据集的实验结果显示,同时比现有的方法更加统一。