Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
翻译:知识蒸馏解决了知识从教师模式向学生模式转移的问题。 在这一过程中,我们通常从教师模式中获取多种类型的知识。 问题在于如何充分利用这些知识来培训学生模式。 我们的初步研究显示:(1) 并非所有的知识对于学习良好的学生模式都是必要的,(2) 知识蒸馏能够从不同培训步骤的某些知识中受益。 对此,我们提出一种行为者-批评方法来选择适当的知识,以便在知识蒸馏过程中进行转让。 此外,我们提供了对培训算法的改进,以减轻计算负担。 GLUE 数据集的实验结果表明,我们的方法大大超越了几个强大的知识蒸馏基线。