Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.
翻译:知识蒸馏是一种有效的机器学习技术,可以将知识从教师模型转移到较小的学生模型,特别是无标签数据。在本文中,我们侧重于RNN-T模型的知识蒸馏,该模型在最先进的自动语音识别(ASR)中广泛使用。具体地说,我们比较了软和硬目标蒸馏法,以在LibriSpeech/LibriLight公共数据集(60k小时)和我们内部数据(600k小时)上培训大型RNNN-T模型。我们发现,当教师和学生有不同的建筑时,例如大师生和小流生,硬沥青更加有效。另一方面,软目标蒸馏法在自我培训情景中效果更好,比如迭接的大型师资培训。对于一个拥有0.6B重量的大型模型,我们在LibSpeech使用Nisy学生软件进行新的SoTA单词错误率(Vev-Other)(8%的相对改进)。我们发现,使用软目标蒸馏法的Nisy学生培训使我们的生产师能够不断调整新的数据域域域。