Knowledge distillation (KD) has gained much attention due to its effectiveness in compressing large-scale pre-trained models. In typical KD methods, the small student model is trained to match the soft targets generated by the big teacher model. However, the interaction between student and teacher is one-way. The teacher is usually fixed once trained, resulting in static soft targets to be distilled. This one-way interaction leads to the teacher's inability to perceive the characteristics of the student and its training progress. To address this issue, we propose Interactive Knowledge Distillation (IKD), which also allows the teacher to learn to teach from the feedback of the student. In particular, IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps: a course step to optimize student with the soft target of teacher, and an exam step to optimize teacher with the feedback of student. IKD is a general framework that is orthogonal to most existing knowledge distillation methods. Experimental results show that IKD outperforms traditional KD methods on various NLP tasks.
翻译:知识蒸馏(KD)因其在压缩大规模预培训模式方面的效力而得到很多关注。 在典型的KD方法中,小型学生模式经过培训,以匹配大型教师模式产生的软目标。然而,师生之间的互动是单向的。教师通常在经过培训后就固定下来,从而产生静态软目标,需要蒸馏。这种单向互动导致教师无法了解学生的特点及其培训进展。为了解决这一问题,我们提议了互动知识蒸馏(IKD),这也使教师能够学习从学生的反馈中学习。特别是,IKD培训教师模式,以便在每个培训阶段为某个学生产生具体的软目标。通过两个迭接步骤实现师生的共同优化:一个以软目标优化学生的课程步骤,以及一个以学生反馈优化教师的考试步骤。IKD是一个总框架,它与大多数现有的知识蒸馏方法相交替。实验结果显示,IKD在各种NLP任务上超越传统KD方法。