Mixup is a popular data augmentation technique based on creating new samples by linear interpolation between two given data samples, to improve both the generalization and robustness of the trained model. Knowledge distillation (KD), on the other hand, is widely used for model compression and transfer learning, which involves using a larger network's implicit knowledge to guide the learning of a smaller network. At first glance, these two techniques seem very different, however, we found that ``smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup. Although many mixup variants and distillation methods have been proposed, much remains to be understood regarding the role of a mixup in knowledge distillation. In this paper, we present a detailed empirical study on various important dimensions of compatibility between mixup and knowledge distillation. We also scrutinize the behavior of the networks trained with a mixup in the light of knowledge distillation through extensive analysis, visualizations, and comprehensive experiments on image classification. Finally, based on our findings, we suggest improved strategies to guide the student network to enhance its effectiveness. Additionally, the findings of this study provide insightful suggestions to researchers and practitioners that commonly use techniques from KD. Our code is available at https://github.com/hchoi71/MIX-KD.
翻译:混合是一种流行的数据增强技术,其基础是通过两个特定数据样本之间的线性内插生成新的样本,从而改进了经过培训的模式的概括性和稳健性。另一方面,知识蒸馏(KD)被广泛用于模型压缩和传输学习,这涉及使用更大的网络的隐含知识来指导对较小网络的学习。乍一看,这两种技术似乎非常不同,但我们发现“吸附性”是两者之间的联系,也是理解KD与混合的相互作用的一个关键属性。虽然提出了许多混合变异和蒸馏方法,但对于知识蒸馏过程中混合作用的作用仍有许多有待理解。在本文中,我们介绍了关于混合与知识蒸馏之间兼容性各个重要层面的详细实证研究。我们还审视了通过广泛分析、可视化和图像分类综合实验,在知识蒸馏方面受过训练的网络的行为。最后,我们根据我们的调查结果,建议改进战略,指导学生网络在知识蒸馏过程中发挥作用。我们提出的这项研究结论来自M71/M公司的共同分析技术。