Mixup is a popular data augmentation technique based on creating new samples by linear interpolation between two given data samples, to improve both the generalization and robustness of the trained model. Knowledge distillation (KD), on the other hand, is widely used for model compression and transfer learning, which involves using a larger network's implicit knowledge to guide the learning of a smaller network. At first glance, these two techniques seem very different, however, we found that "smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup. Although many mixup variants and distillation methods have been proposed, much remains to be understood regarding the role of a mixup in knowledge distillation. In this paper, we present a detailed empirical study on various important dimensions of compatibility between mixup and knowledge distillation. We also scrutinize the behavior of the networks trained with a mixup in the light of knowledge distillation through extensive analysis, visualizations, and comprehensive experiments on image classification. Finally, based on our findings, we suggest improved strategies to guide the student network to enhance its effectiveness. Additionally, the findings of this study provide insightful suggestions to researchers and practitioners that commonly use techniques from KD. Our code is available at https://github.com/hchoi71/MIX-KD.
翻译:混合是一种流行的数据增强技术,其基础是通过两个特定数据样本之间的线性内插生成新的样本,从而改进了经过培训的模式的概括性和稳健性。另一方面,知识蒸馏(KD)被广泛用于模型压缩和传输学习,这涉及使用更大的网络的隐含知识来指导对较小网络的学习。乍一看,这两种技术似乎大不相同,但我们发现,“吸附性”是两者之间的联系,也是理解KD与混合的相互作用的一个关键属性。虽然提出了许多混合变异和蒸馏方法,但对于知识蒸馏过程中混合的作用仍有许多有待理解。在本论文中,我们对混合和知识蒸馏之间的兼容性的各个重要层面进行了详细的实证研究。我们还根据通过广泛分析、可视化和图像分类的全面实验对知识蒸馏过程进行了混合,对经过培训的网络的行为进行了审视。最后,我们根据我们的调查结果,建议改进战略,指导学生网络在知识蒸馏过程中如何发挥混合作用。我们提出的这项研究结论是,从共同的KHI/M 研究人员到共同的理论。