Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected teacher model on the pre-defined training dataset. In this paper, we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation. Experimental results show that (1) proper selection of teacher model can boost the performance of student model; (2) conducting KD with 10% informative instances achieves comparable performance while greatly accelerates the training; (3) the student performance can be boosted by adjusting the supervision contribution of different alignment objective. We find dynamic knowledge distillation is promising and provide discussions on potential future directions towards more efficient KD methods. Our code is available at https://github.com/lancopku/DynamicKD.
翻译:知识蒸馏~ (KD) 已证明对压缩大规模预先培训语言模型有效,但是,现有方法静态地进行KD,例如,学生模型使其产出分配与预设培训数据集中选定的教师模型的输出分配相一致。在本文中,我们探讨动态知识蒸馏是否使学生能够根据自身能力、学生业绩和学习效率调整学习程序。我们探讨了在三个方面动态调整:教师模型的采用、数据选择和KD目标的调整。实验结果表明:(1) 适当选择教师模型可以提高学生模型的性能;(2) 以10%的信息实例进行KD,取得类似业绩,同时大大加快培训速度;(3) 通过调整不同调整目标的监督贡献,学生业绩可以提高。我们发现动态知识蒸馏很有希望,并就今后走向更高效的KD方法的潜在方向进行讨论。我们的代码可在https://github.com/lancopku/DynamiKD查阅。