Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide the student model. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the covariance of the teacher-student cross-entropy. A practical metric, the stddev of teacher's mean probability (T. stddev), is further presented and well justified empirically. Besides the theoretical understanding, we also introduce a new entropy-based data-mixing DA scheme, CutMixPick, to further enhance CutMix. Extensive empirical studies support our claims and demonstrate how we can harvest considerable performance gains simply by using a better DA scheme in knowledge distillation.
翻译:知识蒸馏( KD) 是一种使用教师模型指导学生模型的一般神经网络培训方法。 现有的工作主要是从网络输出方面研究KD( 试图设计更好的KD损失函数), 虽然很少有人试图从输入方面了解KD 。 特别是, 它与数据增强( DA) 的相互作用还没有得到很好理解。 在本文中, 我们问: 为什么一些DA( 例如, CutMix) 计划本身比KD 中的其他计划效果要好得多? 是什么使得KD 中的DA“ 好”? 我们从统计角度进行的调查表明, 良好的DA 计划应该减少教师- 学生跨元素的共变。 一个实用的衡量标准, 教师潜在概率( T. stdddev) 的标准, 进一步被提出, 并有充分的经验依据。 除了理论上的理解, 我们还引入一个新的基于 entropy 数据混合的DA 计划, CutMixPick 来进一步加强 CutMix 。 广泛的实证性研究支持我们的索赔, 并证明我们如何通过更好地运用DA 计划取得相当大的业绩成果。