Knowledge distillation, aimed at transferring the knowledge from a heavy teacher network to a lightweight student network, has emerged as a promising technique for compressing neural networks. However, due to the capacity gap between the heavy teacher and the lightweight student, there still exists a significant performance gap between them. In this paper, we see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance to train a much more lightweight student, called a res-student. We combine the student and the res-student into a new student, where the res-student rectifies the errors of the former student. Such a residual-guided process can be repeated until the user strikes the balance between accuracy and cost. At inference time, we propose a sample-adaptive strategy to decide which res-students are not necessary for each sample, which can save computational cost. Experimental results show that we achieve competitive performance with 18.04$\%$, 23.14$\%$, 53.59$\%$, and 56.86$\%$ of the teachers' computational costs on the CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet datasets. Finally, we do thorough theoretical and empirical analysis for our method.
翻译:旨在将知识从重教师网络转移到轻量级学生网络的知识蒸馏,已成为压缩神经网络的一种有希望的技术,然而,由于重教师和轻量级学生的能力差距,它们之间仍然存在着巨大的绩效差距。在本文中,我们看到知识蒸馏以新的光线,利用知识差距或残余,将教师和学生之间的知识蒸馏作为培训较轻学生的指南,称为 " 新生学生 " 。我们把学生和再生学生合并为新学生,使学生能够纠正前学生的错误。在用户达到准确与成本之间的平衡之前,可以重复这种留级制导流程。在推断时,我们提出了一个抽样调整战略,以决定每个样本不需要哪些再生,这可以节省计算成本。实验结果表明,我们取得了18.04美元、23.14美元、53.59美元和56.86美元的新学生纠正了前学生的错误。这种留级制流程可以重复到用户达到准确与成本之间的平衡。 在推断时,我们提出了一个抽样调整战略,决定每个样本不需要哪些再生,这可以节省计算成本。实验结果表明,我们实现了18.04美元、23.14美元、53.59美元、53.98.98美元和56.86美元的教师图像网络最终数据、CIAR-10模型计算方法。