Knowledge distillation has been applied to various tasks successfully. The current distillation algorithm usually improves students' performance by imitating the output of the teacher. This paper shows that teachers can also improve students' representation power by guiding students' feature recovery. From this point of view, we propose Masked Generative Distillation (MGD), which is simple: we mask random pixels of the student's feature and force it to generate the teacher's full feature through a simple block. MGD is a truly general feature-based distillation method, which can be utilized on various tasks, including image classification, object detection, semantic segmentation and instance segmentation. We experiment on different models with extensive datasets and the results show that all the students achieve excellent improvements. Notably, we boost ResNet-18 from 69.90% to 71.69% ImageNet top-1 accuracy, RetinaNet with ResNet-50 backbone from 37.4 to 41.0 Boundingbox mAP, SOLO based on ResNet-50 from 33.1 to 36.2 Mask mAP and DeepLabV3 based on ResNet-18 from 73.20 to 76.02 mIoU. Our codes are available at https://github.com/yzd-v/MGD.
翻译:对各种任务都成功地应用了知识蒸馏法。当前的蒸馏算法通常通过模仿教师的输出来提高学生的成绩。 本文表明,教师也可以通过指导学生的特性恢复来提高学生的代表性。 从这个角度看,我们建议使用蒙面的催化蒸馏法(MGD),这个方法很简单:我们掩蔽学生特征的随机像素,迫使它通过一个简单的块块来生成教师的全部特征。MGD是一种真正通用的基于特性的蒸馏法,可以用于各种任务,包括图像分类、对象检测、语义分解和实例分解。我们用广泛的数据集对不同的模型进行实验,结果显示所有学生都取得了出色的改进。值得注意的是,我们用ResNet-18从69.90%提高到71.69%的图像网顶端1精度,用ResNet-50骨干从37.4%提高到41.0 Boundingbox mAP,SOLO基于ResNet-50从33.1到36.2 Make mAP和DeepLabV3, 以ResNet-18:73.20-D.02/MG.MG.