Although deep neural networks have enjoyed remarkable success across a wide variety of tasks, their ever-increasing size also imposes significant overhead on deployment. To compress these models, knowledge distillation was proposed to transfer knowledge from a cumbersome (teacher) network into a lightweight (student) network. However, guidance from a teacher does not always improve the generalization of students, especially when the size gap between student and teacher is large. Previous works argued that it was due to the high certainty of the teacher, resulting in harder labels that were difficult to fit. To soften these labels, we present a pruning method termed Prediction Uncertainty Enlargement (PrUE) to simplify the teacher. Specifically, our method aims to decrease the teacher's certainty about data, thereby generating soft predictions for students. We empirically investigate the effectiveness of the proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet. Results indicate that student networks trained with sparse teachers achieve better performance. Besides, our method allows researchers to distill knowledge from deeper networks to improve students further. Our code is made public at: \url{https://github.com/wangshaopu/prue}.
翻译:尽管深层的神经网络在各种各样的任务中取得了显著的成功,但其日益扩大的体积也要求大量部署这些模型。为了压缩这些模型,建议知识蒸馏将知识从一个繁琐(教师)的(教师)网络转移到一个轻量(学生)网络。然而,教师的指导并非总能改善学生的普遍化,特别是在学生和教师之间差距很大的情况下。以前的工作认为,这是因为教师的高度确定性,导致标签更难调和。为了软化这些标签,我们提出了一种称为预测不确定性扩大(Pruce)的修剪方法来简化教师。具体地说,我们的方法旨在降低教师对数据的确定性,从而给学生带来软的预测。我们通过在CIFAR-10-100、Tiny-ImageNet和图像网的实验,对拟议方法的有效性进行了实证性调查。结果显示,在教师稀少的情况下培训的学生网络取得了更好的业绩。此外,我们的方法允许研究人员从更深的网络中提取知识,以便进一步改进学生。我们的代码公布于:\\ urpurprus/ magus/ max.