How to train an ideal teacher for knowledge distillation is still an open problem. It has been widely observed that a teacher minimizing the empirical risk not necessarily yields the best performing student, suggesting a fundamental discrepancy between the common practice in teacher network training and the distillation objective. To fill this gap, we propose a novel student-oriented teacher network training framework SoTeacher, inspired by recent findings that student performance hinges on teacher's capability to approximate the true label distribution of training samples. We theoretically established that (1) the empirical risk minimizer with proper scoring rules as loss function can provably approximate the true label distribution of training data if the hypothesis function is locally Lipschitz continuous around training samples; and (2) when data augmentation is employed for training, an additional constraint is required that the minimizer has to produce consistent predictions across augmented views of the same training input. In light of our theory, SoTeacher renovates the empirical risk minimization by incorporating Lipschitz regularization and consistency regularization. It is worth mentioning that SoTeacher is applicable to almost all teacher-student architecture pairs, requires no prior knowledge of the student upon teacher's training, and induces almost no computation overhead. Experiments on two benchmark datasets confirm that SoTeacher can improve student performance significantly and consistently across various knowledge distillation algorithms and teacher-student pairs.
翻译:如何培训理想的教师进行知识蒸馏,仍然是一个尚未解决的问题。人们广泛注意到,教师将经验风险降到最低并不一定产生表现最佳的学生,这表明教师网络培训的常见做法与蒸馏目标之间存在根本差异。为了填补这一差距,我们提议了一个面向学生的教师网络培训新框架SoTeacher。 我们根据最近的调查结果,学生业绩取决于教师是否有能力接近培训样本的真正标签分布。我们理论上确定:(1) 经验风险最小化,并有适当的评分规则,因为损失函数可以明显地接近培训数据的真正标签分布,如果假设功能是当地Lipschitz在培训样本周围持续运行;以及(2) 当为培训使用数据扩增时,需要额外的限制,即最小化者必须在相同培训投入的更多观点中作出一致的预测。根据我们的理论,教师业绩靠吸收Lipschitz的正规化和一致性规范化来重新确定经验风险最小化。值得一提的是,SoTeacher适用于几乎所有的师范搭配,不需要事先了解学生对师资的训练情况,也无需事先知道师资的学习基准和不断提升学生的演算。