Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.
翻译:最近的研究指出,知识蒸馏(KD)存在两个退化问题,即师生差距和与强力数据增强不相容的问题,使得它不适用于培训最先进的模型,这些模型经过先进的增强能力培训。然而,我们发现,一个关键因素,即用于产生师生模型概率的软负函数的温度,在以往方法中大都被忽视。由于温度适当调适,KD的这种退化问题可以大大减轻。然而,我们建议Meta知识蒸馏(MKD)不是依靠天真的网格搜索,因为这一搜索显示可转移性差,而是用可学习的元温度参数来培训最先进的模型。在培训期间,根据学习目标的梯度调整了元参数。我们确认MKD对不同的数据集尺度、不同的教师/学生结构以及不同的数据增强类型都非常强大。与MKD相比,我们用普通的 VIT结构取得了最佳的性能,与仅使用图像Net-1K(MK)的模型相比,我们用1,6-K的培训数据从小到大模型,从0.5,我们用1,我们用1,6-MAAA5,我们用1,我们用1,我们用1,从0.5,从小到0.AA,从0.1,从0.1,从0.,从0.,从0.,从0.,从0.,从0.,从0.,从0.,到0.1,从0.,从0.,从0.,从0.,到0.,从0.,到0.,从0.,从0.,到0.,从0.,从0.,从0.,到0.,从0.,从0.,从0.,到0.,到0.,从0.,到0.,从0.,从0.,从0.,到0.,到0.,从0.,从0.,从0.,从0.,从0.,从0.,到0.,到0.,到0.,到0.,到0.,到0.,到0.,到0.,从1,从1,到0.,从1,从1,从1,从1,从0.,从0.,从0.,从0.,从0.,从1,从1,从0.,从0.,从0.,从1,从1,到0.,到0.,到0.,到0.,到0.,到0.,从1,到0.,到0.