Many existing studies on knowledge distillation have focused on methods in which a student model mimics a teacher model well. Simply imitating the teacher's knowledge, however, is not sufficient for the student to surpass that of the teacher. We explore a method to harness the knowledge of other students to complement the knowledge of the teacher. We propose deep collective knowledge distillation for model compression, called DCKD, which is a method for training student models with rich information to acquire knowledge from not only their teacher model but also other student models. The knowledge collected from several student models consists of a wealth of information about the correlation between classes. Our DCKD considers how to increase the correlation knowledge of classes during training. Our novel method enables us to create better performing student models for collecting knowledge. This simple yet powerful method achieves state-of-the-art performances in many experiments. For example, for ImageNet, ResNet18 trained with DCKD achieves 72.27\%, which outperforms the pretrained ResNet18 by 2.52\%. For CIFAR-100, the student model of ShuffleNetV1 with DCKD achieves 6.55\% higher top-1 accuracy than the pretrained ShuffleNetV1.
翻译:许多现有的知识蒸馏研究专注于学生模型良好地模仿教师模型。然而,简单地模仿教师的知识是不足以让学生超越教师的。我们探索了一种方法来利用其他学生的知识来补充教师的知识。我们提出了一种用于模型压缩的深度集体知识蒸馏(DCKD),它是一种让学生模型通过收集来自教师模型和其他学生模型的丰富信息来获得知识的方法。从多个学生模型收集的知识包含有关类之间关联性的丰富信息。我们的DCKD考虑如何在训练过程中增加类的关联性知识。我们的新方法使我们能够创建更好的学生模型来收集知识。这种简单而强大的方法在许多实验中实现了最先进的性能。例如,对于ImageNet,使用DCKD训练的ResNet18达到了72.27\%,比预训练的ResNet18高出了2.52\%。对于CIFAR-100,使用DCKD的ShuffleNetV1的学生模型比预训练的ShuffleNetV1高出6.55\%的top-1准确率。