Knowledge Distillation (KD) is a widely-used technology to inherit information from cumbersome teacher models to compact student models, consequently realizing model compression and acceleration. Compared with image classification, object detection is a more complex task, and designing specific KD methods for object detection is non-trivial. In this work, we elaborately study the behaviour difference between the teacher and student detection models, and obtain two intriguing observations: First, the teacher and student rank their detected candidate boxes quite differently, which results in their precision discrepancy. Second, there is a considerable gap between the feature response differences and prediction differences between teacher and student, indicating that equally imitating all the feature maps of the teacher is the sub-optimal choice for improving the student's accuracy. Based on the two observations, we propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors, respectively. RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill, which consistently outperforms the traditional soft label distillation. PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student's accuracy. On MS COCO and PASCAL VOC benchmarks, extensive experiments are conducted on various detectors with different backbones to validate the effectiveness of our method. Specifically, RetinaNet with ResNet50 achieves 40.4% mAP in MS COCO, which is 3.5% higher than its baseline, and also outperforms previous KD methods.
翻译:知识蒸馏(KD)是一种广泛使用的技术,可以继承从繁琐的教师模式到压缩学生模式的信息,从而实现模型压缩和加速。与图像分类相比,对象探测是一项更为复杂的任务,设计特定的KD方法来检测目标不是三重任务。在这项工作中,我们详细研究师生检测模式之间的行为差异,并获得两种令人感兴趣的观察:第一,教师和学生对其所检测的候选箱的排名差别很大,这导致其精确差异。第二,特征反应差异与教师和学生之间的预测差异之间有很大差距,表明与教师的所有特征地图一样,也是提高学生准确度的次最佳选择。根据这两项观察,我们建议 RM 和学生对所检测的候选箱的排名不同,RMF将候选箱的排名作为一种新形式的知识蒸馏,这始终超越传统的软标签蒸馏。 PFIFI试图将模型的精度与学生的精确度进行对比。 PISCFI 和对各种常规的精确度的模型进行对比。