Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions and more. Our results, using image classification as a case study and three state-of-the-art knowledge distillation techniques, show that knowledge distillation methods can indeed indirectly distill other kinds of properties beyond improving task performance. And while we believe that understanding the distillation process is important in itself, we also demonstrate that our results can pave the path for important practical applications as well.
翻译:知识蒸馏旨在将有用的信息从教师网络向学生网络转移,其主要目标是提高学生在手头的任务的绩效。多年来,出现了大量创新技术和利用知识蒸馏的案例。然而,尽管取得了各种改进,但社区对这一过程的基本理解似乎存在着明显的差距。具体地说,在知识蒸馏过程中被提炼出来的知识是什么?换句话说,学生以什么方式变得与教师相似?它是否以同样的方式开始将物体本地化?它是否被相同的对立样本所愚弄?它的数据是否具有相似性?我们的工作展示了试图回答这些问题和更多问题的全面研究。我们利用图像分类作为案例研究和三种最新知识蒸馏技术的结果表明,知识蒸馏方法确实可以间接地蒸馏其他种类的特性,而不只是改进任务绩效。我们相信,了解蒸馏过程本身很重要,但我们也表明,我们的成果可以为重要的实用应用铺平道路。