State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data. Human transcription is expensive and time consuming. Factors such as the quality and consistency of the transcription can greatly affect the performance of the ASR models trained with these data. In this paper, we show that we can train a strong teacher model to produce high quality pseudo labels by utilizing recent self-supervised and semi-supervised learning techniques. Specifically, we use JUST (Joint Unsupervised/Supervised Training) and iterative noisy student teacher training to train a 600 million parameter bi-directional teacher model. This model achieved 4.0% word error rate (WER) on a voice search task, 11.1% relatively better than a baseline. We further show that by using this strong teacher model to generate high-quality pseudo labels for training, we can achieve 13.6% relative WER reduction (5.9% to 5.1%) for a streaming model compared to using human labels.
翻译:高水平自动语音识别( ASR) 系统经过数万小时的贴标签语音数据培训。 人类笔录费用昂贵,耗费时间。 诸如抄录质量和一致性等因素会大大影响用这些数据培训的ASR模型的性能。 在本文中,我们显示,我们可以通过使用最近的自我监管和半监管学习技术,来训练一个强大的教师模型来制作高质量的假标签。 具体地说,我们使用Just( 联合不受监督/监督的培训) 和反复吵闹的学生教师培训来训练6亿个参数双向双向教师模型。 这个模型在语音搜索任务中实现了4.0%的字差率(WER),比基线要好11.1%,我们进一步表明,通过使用这一强大的教师模型来产生高质量的培训假标签,我们可以在流动模型中实现13.6%的相对WER(5.9%到5.1%)的减少率,而与使用人类标签相比,我们可以用这个模式来进行流动。