Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. Code: http://github.com/HobbitLong/RepDistiller.
翻译:我们通常希望将代表性知识从一个神经网络转移到另一个神经网络,例如将一个大型网络蒸馏成一个较小的网络,将知识从一个感官模式传入第二个网络,或将一组模型汇集成一个单一的估算器。知识蒸馏,这些问题的标准方法,最大限度地缩小了教师网络和学生网络的概率产出之间的 KL差异。我们证明这一目标忽视了教师网络的重要结构知识。这促使了我们培训一个学生以获取教师数据代表性中更多信息的一个替代目标。我们将此目标作为对比性学习来制定。实验表明,我们由此产生的新目标在各种知识转让任务上,包括单一模型压缩、混合蒸馏和交叉转让,都比其他尖端蒸馏机。我们的方法在许多转让任务中设置了新的状态,有时甚至比教师网络在与知识蒸馏相结合时更优于。代码:http://github.com/hobitLong/RepDisiller。