Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must somehow deviate from the teacher (Stanton et al., 2021). But what is the nature of these deviations, and how do they relate to gains in generalization? We investigate these questions through a series of experiments across image and language classification datasets. First, we observe that distillation consistently deviates in a characteristic way: on points where the teacher has low confidence, the student achieves even lower confidence than the teacher. Secondly, we find that deviations in the initial dynamics of training are not crucial -- simply switching to distillation loss in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives to understand the role of student-teacher deviations in our experiments, one casting distillation as a regularizer in eigenspace, and another as a gradient denoiser. Our analysis bridges several gaps between existing theory and practice by (a) focusing on gradient-descent training, (b) by avoiding label noise assumptions, and (c) by unifying several disjoint empirical and theoretical findings.
翻译:知识蒸馏被广泛用来改善“ 学生” 网络的性能, 希望模仿“ 教师” 网络的软概率。 然而, 学生为了工作, 自我蒸馏必须在某种程度上偏离教师( Stanton 等人, 2021 ) 。 但是, 这些偏离的性质是什么, 以及它们与普遍化的成果有什么关系? 我们通过一系列图像和语言分类数据集的实验来调查这些问题。 首先, 我们观察到, 蒸馏总是以某种特点的方式偏离: 在教师信心低的点上, 学生比教师更缺乏信心。 其次, 我们发现, 最初培训动态的偏差并不关键 -- 仅仅转换到在培训中期的蒸馏损失可以收回它的大部分收益。 然后, 我们从两个平行的理论角度来理解学生- 教师 偏差在我们实验中所起的作用, 一个是将蒸馏成在乙型空间中的成成正质, 另一个是渐变调调。 我们的分析弥补了现有理论和实践之间的一些差距, 通过(a) 侧重于梯度和实验性化的假设, (b) 避免一些 和理论性联合性研究, (b) 避免了 的标签, (b) 避免了 和 和 联合性 的 的, (b) (b) (dreli) (b) (b) (b) 避免了 和 的 的 和 的 联合的) (b)