Knowledge distillation has become an important technique for model compression and acceleration. The conventional knowledge distillation approaches aim to transfer knowledge from teacher to student networks by minimizing the KL-divergence between their probabilistic outputs, which only consider the mutual relationship between individual representations of teacher and student networks. Recently, the contrastive loss-based knowledge distillation is proposed to enable a student to learn the instance discriminative knowledge of a teacher by mapping the same image close and different images far away in the representation space. However, all of these methods ignore that the teacher's knowledge is multi-level, e.g., individual, relational and categorical level. These different levels of knowledge cannot be effectively captured by only one kind of supervisory signal. Here, we introduce Multi-level Knowledge Distillation (MLKD) to transfer richer representational knowledge from teacher to student networks. MLKD employs three novel teacher-student similarities: individual similarity, relational similarity, and categorical similarity, to encourage the student network to learn sample-wise, structure-wise and category-wise knowledge in the teacher network. Experiments demonstrate that MLKD outperforms other state-of-the-art methods on both similar-architecture and cross-architecture tasks. We further show that MLKD can improve the transferability of learned representations in the student network.
翻译:传统知识蒸馏方法旨在通过最大限度地减少教师与学生网络之间概率产出之间的KL差异,将知识从教师向学生网络转移,这种差异只是考虑到教师与学生网络个人代表之间的相互关系。最近,提出了对比式的基于损失的知识蒸馏法,以使学生能够通过在代表空间内绘制相近和相距遥远的图像来了解教师的区别性实例。然而,所有这些方法都忽略了教师的知识是多层次的,例如个人、关系和绝对水平。这些不同水平的知识不能仅由一种监督信号有效掌握。在这里,我们引入了多层次知识蒸馏法(MLKD),将更丰富的代表性知识从教师网络转移到学生网络。MLKD采用三种新型教师与学生的相似性:个人相似性、关系相似性和绝对相似性,以鼓励学生网络在教师网络内学习抽样、结构上和类别上的知识。实验性显示MKKDFSB在学习方法上的跨州和MAFRMSDR 上展示了类似性。