Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.
翻译:知识蒸馏是一种流行的技术,可以通过模仿将大型教师模型的知识转移到更小的学生模型。然而,直接通过对齐教师和学生之间的特征映射来进行蒸馏可能会对学生施加过于严格的约束,从而降低学生模型的性能。为了缓解上述特征不对齐问题,现有的工作主要集中在通过像素级变换空间对齐教师和学生的特征映射上。在本文中,我们新发现了沿着通道维度对齐教师和学生之间的特征映射也可以有效解决特征不对齐问题。具体来说,我们提出了一个可学习的非线性通道变换来对齐学生和教师模型的特征。基于它,我们进一步提出了一个简单而通用的特征蒸馏框架,只有一个超参数来平衡蒸馏损失和任务特定损失。广泛的实验结果表明,我们的方法在各种计算机视觉任务中均取得了显著的性能提升,包括图像分类(ImageNet-1K上使用MobileNetV1的top-1准确性+3.28%),目标检测(在MS COCO上使用基于ResNet50的Faster-RCNN的bbox mAP +3.9%),实例分割(在基于ResNet50的Mask-RCNN中使用+2.8% Mask mAP),以及语义分割(在Cityscapes上使用基于ResNet18的PSPNet的+4.66% mIoU),这证明了所提出方法的有效性和通用性。代码将公开发布。