Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the `prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as `input', not just `target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO. The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model.
翻译:知识蒸馏(KD)已经显示出非常有前途的能力,可以将学习表示从大的模型(教师)转移到小的模型(学生)。然而,随着学生和教师之间的容量差距越来越大,现有的KD方法无法取得更好的结果。我们的工作表明,“先验知识”对KD至关重要,特别是当应用大教师时。特别是,我们提出了动态先验知识(DPK),它将一部分教师特征集成为先验知识,以便在特征蒸馏之前使用。这意味着我们的方法不仅将教师的特征作为“目标”,而且还将其作为“输入”。此外,我们根据特征差距动态调整先验知识的比例,从而以适当的难度引导学生。为了评估所提出的方法,我们在两个图像分类基准(即CIFAR100和ImageNet)和一个对象检测基准(即MS COCO)上进行了广泛的实验。结果表明,在各种设置下我们的方法都具有优越性。此外,我们的DPK使学生模型的性能与教师模型的性能呈正相关,这意味着我们可以通过应用更大的教师来进一步提高学生的准确性。更重要的是,DPK为任何给定模型的教师模型选择提供了快速解决方案。