Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher network to strengthen a smaller student. Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student. However, the knowledge redundancy arises since the knowledge shows different values to the student at different learning stages. In this paper, we propose Knowledge Condensation Distillation (KCD). Specifically, the knowledge value on each sample is dynamically estimated, based on which an Expectation-Maximization (EM) framework is forged to iteratively condense a compact knowledge set from the teacher to guide the student learning. Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible computation overhead. Thus, it presents one new perspective for KD, in which the student that actively identifies teacher's knowledge in line with its aptitude can learn to learn more effectively and efficiently. Experiments on standard benchmarks manifest that the proposed KCD can well boost the performance of student model with even higher distillation efficiency. Code is available at https://github.com/dzy3/KCD.
翻译:知识蒸馏(KD) 将知识从高能力教师网络转移,以加强一个较小的学生。现有方法侧重于挖掘知识提示,并将整个知识传授给学生。然而,知识的冗余产生,因为知识在不同学习阶段向学生展示了不同的价值观。在本文中,我们提出知识浓缩蒸馏(KCD) 。具体地说,每种样本的知识价值都是动态估计的,据此,期望-最大化(EM)框架可以形成,迭接地浓缩教师用来指导学生学习的一套紧凑知识。我们的方法很容易在现成的KD方法之上发展,而没有额外的培训参数和可忽略的计算间接费用。因此,它为KD提供了一个新的视角,其中积极确定教师知识与其才能相符的学生可以学习更有成效和效率。标准基准实验表明,拟议的KCD能够以甚至更高的蒸馏效率来提升学生模式的绩效。可在 https://github.com/dzy3/KCD上找到代码。