A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements -- cross-fitting and loss correction -- to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.
翻译:一种流行的压缩模型方法就是培训一种廉价的学生模型,以模拟高度准确但繁琐的教师模型的阶级概率。 令人惊讶的是,这种两步知识蒸馏过程往往比直接用标签数据对学生进行的培训更准确。 为了解释和加强这种现象,我们把知识蒸馏作为一种半参数推论问题,以最佳学生模型为目标,以未知的贝耶斯等级概率作为骚扰,以教师概率作为插座干扰估计。我们通过调整现代半参数工具,为标准蒸馏的预测错误提供了新的保证,并发展了两种强化措施 -- -- 交叉装配和损失校正 -- -- 以减轻教师对学生业绩的过装和不足的影响。 我们用实验方法验证了我们关于表格和图像数据的经验性结论,并观察从我们的知识蒸馏改进中取得一致的改进。