Knowledge distillation (KD) is a general neural network training approach that uses a teacher to guide a student. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the variance of the teacher's mean probability, which will eventually lead to a lower generalization gap for the student. Besides the theoretical understanding, we also introduce a new entropy-based data-mixing DA scheme to enhance CutMix. Extensive empirical studies support our claims and demonstrate how we can harvest considerable performance gains simply by using a better DA scheme in knowledge distillation.
翻译:知识蒸馏( KD) 是一种使用教师指导学生的一般神经网络培训方法。 现有的工程主要是从网络输出方面研究KD( 试图设计更好的KD损失函数), 虽然很少有人试图从输入方面了解KD 。 特别是, 它与数据增强( DA) 的相互作用还没有被很好地理解。 在本文中, 我们问 : 为什么有些DA 计划( 例如 CutMix) 本身比 KD 中的其他计划要好得多? 是什么使得KD 中的DA 变得“ 好”? 我们从统计角度进行的调查表明, 良好的DA 计划应该减少教师平均概率的差异, 最终会降低学生的普遍差距。 除了理论理解外, 我们还引入一种新的基于 entropy 的数据混合DA 计划来增强 CutMix 。 广泛的实证研究支持我们的要求, 并证明我们如何仅仅利用更好的DA 知识蒸馏计划来取得相当大的绩效收益。