Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one. Due to the limited capacity of the student, the student would underfit the teacher. Therefore, student performance would unexpectedly drop when distilling from an oversized teacher, termed the capacity gap problem. We investigate this problem by study the gap of confidence between teacher and student. We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence. We propose Spherical Knowledge Distillation to eliminate this gap explicitly, which eases the underfitting problem. We find this novel knowledge representation can improve compact models with much larger teachers and is robust to temperature. We conducted experiments on both CIFAR100 and ImageNet, and achieve significant improvement. Specifically, we train ResNet18 to 73.0 accuracy, which is a substantial improvement over previous SOTA and is on par with resnet34 almost twice the student size.
翻译:知识蒸馏的目的是通过从更大范围学习绘图功能来获得一个紧凑而有效的模型。 由于学生能力有限,学生会给教师造成不适。 因此,学生在从一个过于庞大的教师中蒸馏时表现会出人意料地下降,称为能力差距问题。 我们通过研究教师和学生之间的信任差距来调查这一问题。 我们发现,对于知识蒸馏来说,信心的程度是不必要的,如果学生被迫学习信心,则会损害学生的成绩。 我们提议球形知识蒸馏以明确消除这一差距,这可以缓解不适问题。 我们发现,这种新颖的知识体现可以改善与大得多的教师的紧凑模式,并且适应温度。 我们在CIFAR100和图像网络上进行了实验,并取得了显著改善。 具体地说,我们对ResNet18至73.0的精度进行了培训,这比以前的SOTA大大改进,并且与Resnet34接近学生的两倍。