Knowledge distillation (KD) has become an important technique for model compression and knowledge transfer. In this work, we first perform a comprehensive analysis of the knowledge transferred by different KD methods. We demonstrate that traditional KD methods, which minimize the KL divergence of softmax outputs between networks, are related to the knowledge alignment of an individual sample only. Meanwhile, recent contrastive learning-based KD methods mainly transfer relational knowledge between different samples, namely, knowledge correlation. While it is important to transfer the full knowledge from teacher to student, we introduce the Multi-level Knowledge Distillation (MLKD) by effectively considering both knowledge alignment and correlation. MLKD is task-agnostic and model-agnostic, and can easily transfer knowledge from supervised or self-supervised pretrained teachers. We show that MLKD can improve the reliability and transferability of learned representations. Experiments demonstrate that MLKD outperforms other state-of-the-art methods on a large number of experimental settings including different (a) pretraining strategies (b) network architectures (c) datasets (d) tasks.
翻译:在这项工作中,我们首先对不同KD方法所传授的知识进行全面分析。我们证明,传统KD方法将软成份差异最小化于网络之间的软成份差异最小化,这些传统KD方法仅与单个样本的知识一致性有关。与此同时,最近以对比学习为基础的KD方法主要在不同样本之间转让关系知识,即知识相关性。虽然将全部知识从教师转让给学生很重要,但我们通过有效地考虑知识一致性和相关性,引入了多层次知识蒸馏(MLKD)。MLKD是任务性与模型性-认知性,可以很容易地从受监督或自我监督的事先受过培训的教师那里转让知识。我们表明MLKD可以提高学习表现的可靠性和可转让性。实验表明,MLKD在大量实验环境中超越了其他最先进的方法,包括(a) 不同的(b) 预培训战略(c) 数据设置(d) 。