Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning instance-wise outputs between the teacher and student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process. Experimental results show that GKD outperforms previous KD methods regarding student performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly.
翻译:知识蒸馏(KD)是将知识从一个大型教师传授给一个精密但表现良好的学生的一个有效框架。 以往的KD对预先培训的语言模式的做法主要是通过在师生之间调整实例化产出来转让知识,而忽略了一个重要的知识来源,即教师的梯度。 梯度是教师如何对投入变化作出反应的特点,我们假设这对学生更好地接近教师的基本绘图功能是有益的。 因此,我们提议“高级知识蒸馏”(GKD)将梯度调整目标纳入蒸馏过程。 实验结果表明,GKD在学生表现方面优于先前的KD方法。 进一步的分析表明,纳入梯度知识会使学生的行为与教师更加一致,极大地改进了可解释性。