Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames. These approaches are evaluated for automatic speech recognition on the Librispeech corpus, where 100 hours of data served as the labelled data and 860 hours as the unlabelled data. The biased training outperforms the unbiased training by 15.5% after 250k updates and 23.8% after 100k updates on test-other. For the streaming models, the pre-training approach yields a reduction in word error rate of 44.1%.
翻译:通过蒙面预测预培训(MPPT)自我监督的学习展示了一系列语言处理任务上令人印象深刻的成绩。 本文建议了一种将自我监督的学习偏向于特定任务的方法。 核心想法是略微微微微微微微微微微调整用于获得目标序列的模型。 这导致更好的业绩和大幅度提高培训速度。 此外,本文还提出了一种MPPT变式,允许低脚印流模型进行有效培训,在蒙面和无面框上计算MPPT损失。 这些方法被评价为在Librispeech文库上自动语音识别, 100小时的数据作为标签数据,860小时作为无标签数据。 有偏见的培训在250公里更新后比不偏向培训高出15.5%,在100公里测试后比23.8%。 对于流模型, 培训前方法使字错误率降低44.1%。