With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the \textit{checkpoint-search problem}. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the \textit{capacity-gap} problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR-100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.
翻译:随着神经模型规模的不断扩大,知识蒸馏(KD)作为神经模型压缩的突出工具吸引了更多的注意力。然而,文献中也有反直觉观察显示KD具有一些具有挑战性的局限性。 一个例子是,教师最优秀的检查站不一定是培训KD学生的最佳教师。 因此,一个重要问题是,如何找到教师最佳的蒸馏检查站?通过教师的检查站寻找教师的蒸馏?通过教师的检查将是一个非常烦琐和计算成本昂贵的过程,我们称之为神经模型研究问题。 此外,另一个观察是,更大的教师不一定是KD的更好教师,而KD被称为textit{能力-差距问题。 为了解决这些具有挑战性的问题,我们在这项工作中引入了我们的进步性知识蒸馏(Pro-KD)技术,根据教师的培训足迹为学生确定一个更平稳的培训模式,而不是仅仅依靠从一个经过充分训练的教师那里提取技术。 此外,我们用技术的更高水平的教师不一定是更好的教师的教师,我们使用一种技术来降低卡路里标准,并用一种不同的卡路里(我们用一种技术来降低卡路里)问题。