Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this relational match to the intra-class level. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures, model sizes and training strategies, and can achieve state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at: https://github.com/hunto/DIST_KD .
翻译:与现有知识蒸馏方法不同的是,现有知识蒸馏方法侧重于基线设置,即教师模式和培训战略并不是作为最先进的方法而强大和相互竞争的,本文提出了一种称为DIST的方法,以更好地从更强的教师中提炼出。我们从经验中发现,学生与更强的教师之间的预测差异可能更为严重。结果,KL差异预测的准确匹配会干扰培训,使现有方法效果不佳。在本文件中,我们表明,仅仅保持师生预测之间的关系就足够了,并提出基于关联的损失,以明确从教师那里获取固有的阶级间关系。此外,考虑到不同的情况与每个班级的语义相似性,我们还将这种关系扩大到班级内部级别。我们的方法简单而实用,广泛的实验表明,它适应了各种结构、模型规模和培训战略,并能够实现图像分类、对象探测和语义分化任务方面的一贯状态性表现。代码可以查到: https://github.com/hunto。