In knowledge distillation, previous feature distillation methods mainly focus on the design of loss functions and the selection of the distilled layers, while the effect of the feature projector between the student and the teacher remains under-explored. In this paper, we first discuss a plausible mechanism of the projector with empirical evidence and then propose a new feature distillation method based on a projector ensemble for further performance improvement. We observe that the student network benefits from a projector even if the feature dimensions of the student and the teacher are the same. Training a student backbone without a projector can be considered as a multi-task learning process, namely achieving discriminative feature extraction for classification and feature matching between the student and the teacher for distillation at the same time. We hypothesize and empirically verify that without a projector, the student network tends to overfit the teacher's feature distributions despite having different architecture and weights initialization. This leads to degradation on the quality of the student's deep features that are eventually used in classification. Adding a projector, on the other hand, disentangles the two learning tasks and helps the student network to focus better on the main feature extraction task while still being able to utilize teacher features as a guidance through the projector. Motivated by the positive effect of the projector in feature distillation, we propose an ensemble of projectors to further improve the quality of student features. Experimental results on different datasets with a series of teacher-student pairs illustrate the effectiveness of the proposed method.
翻译:在知识蒸馏中,先前的特色蒸馏方法主要侧重于损失功能的设计以及蒸馏层的选择,而学生和教师之间的特征投影仪的影响仍然未得到充分探讨。在本文中,我们首先讨论投影机的貌似合理的机制,并附有经验证据,然后根据投影机合体提出一种新的特征蒸馏方法,以便进一步改进绩效。我们注意到,学生网络从投影机中受益,即使学生和教师的特征层面相同。在没有投影机的情况下,对学生骨干进行培训,可以被视为一个多任务学习过程,即为学生和教师之间的分类和特征匹配而实现歧视性特征提取,同时进行蒸馏。我们通过调整和实验核实,没有投影机,学生网络尽管有不同的架构和重量初始化,但往往过分适合教师的特征分布。这会导致最终用于分类的学生深度质量质量质量质量的退化。在另一个方面,对教师质量特性的提取过程进行分解,同时通过学习模式的精准,使学生网络的精细化,从而改进教学的精准性结构,同时利用学习模型的精准性模型的精准性模型,使学生网络的精准性模型的精准性调整。</s>