修补师生在蒸馏方面的知识差异 (Fixing the Teacher-Student Knowledge Discrepancy in Distillation)

Training a small student network with the guidance of a larger teacher network is an effective way to promote the performance of the student. Despite the different types, the guided knowledge used to distill is always kept unchanged for different teacher and student pairs in previous knowledge distillation methods. However, we find that teacher and student models with different networks or trained from different initialization could have distinct feature representations among different channels. (e.g. the high activated channel for different categories). We name this incongruous representation of channels as teacher-student knowledge discrepancy in the distillation process. Ignoring the knowledge discrepancy problem of teacher and student models will make the learning of student from teacher more difficult. To solve this problem, in this paper, we propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student and provides the best suitable knowledge to different student networks for distillation. Extensive experiments on different datasets (CIFAR100, ImageNet, COCO) and tasks (image classification, object detection) reveal the widely existing knowledge discrepancy problem between teachers and students and demonstrate the effectiveness of our proposed method. Our method is very flexible that can be easily combined with other state-of-the-art approaches.

翻译：在一个更大的教师网络的指导下培训一个小型学生网络是提高学生业绩的有效途径。尽管存在不同类型,但是在以前的知识蒸馏方法中,不同教师和学生对不同师生的学习指导知识始终保持不变。然而,我们发现,不同网络的教师和学生模式或从不同初始化过程中受过训练的师生模式在不同渠道中具有不同的特征。(例如,不同类别的高活跃渠道)。我们把这种不协调的渠道表述命名为在蒸馏过程中师生知识差异的差别。尽管有不同类型,但教师和学生模式的知识差异问题将使教师和学生学习更加困难。为了解决这个问题,我们在本文件中提出一种新的依靠学生的蒸馏方法,即知识一致的蒸馏方法,使教师的知识与学生更加一致,为不同的学生网络提供最适合的蒸馏知识。关于不同数据集(CIFAR100、图像网、COCO)和任务(图像分类、对象探测)的广泛实验,揭示了教师和学生之间广泛存在的知识差异问题,使教师和学生之间的学习更加困难。我们提议的方法可以非常灵活。我们的方法可以很容易地展示其他方法的有效性。