In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.
翻译:在本文中,我们重述了知识蒸馏作为函数匹配和度量学习问题的有效性。在此过程中,我们验证了三个重要的设计决策,即正则化、软最大值函数和投影层作为重要的组成部分。我们从理论上证明了投影器隐含地编码了关于过去样本的信息,为学生提供了关系梯度。然后,我们展示了表示的归一化与该投影器的训练动态密切相关,这可能对学生的表现产生很大影响。最后,我们展示了一个简单的软最大值函数可以用来解决任何重要的容量差异问题。在各种基准数据集上的实验结果表明,使用这些洞察力可以取得优于或与最先进的知识蒸馏技术相当的性能,尽管计算效率要高得多。特别是,在图像分类(CIFAR100和ImageNet)、目标检测(COCO2017)以及更困难的蒸馏目标上,如训练数据高效的变压器,我们在ImageNet上使用DeiT-Ti获得了77.2%的top-1准确率。