In this work, we propose Mutual Information Maximization Knowledge Distillation (MIMKD). Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network. We demonstrate through extensive experiments that this can be used to improve the performance of low capacity models by transferring knowledge from more performant but computationally expensive models. This can be used to produce better models that can be run on devices with low computational resources. Our method is flexible, we can distill knowledge from teachers with arbitrary network architectures to arbitrary student networks. Our empirical results show that MIMKD outperforms competing approaches across a wide range of student-teacher pairs with different capacities, with different architectures, and when student networks are with extremely low capacity. We are able to obtain 74.55% accuracy on CIFAR100 with a ShufflenetV2 from a baseline accuracy of 69.8% by distilling knowledge from ResNet-50. On Imagenet we improve a ResNet-18 network from 68.88% to 70.32% accuracy (1.44%+) using a ResNet-34 teacher network.
翻译:在这项工作中,我们提出相互信息最大化知识蒸馏(MIMKD) 。我们的方法使用一个对比性的目标,即同时估计和最大限度地扩大教师和学生网络之间当地和全球特征表现的相互信息;我们通过广泛的实验表明,可以通过从更有性能但计算成本高昂的模式转让知识,来提高低能力模型的性能;这可用于产生更好的模型,可以在低计算资源设备上运行。我们的方法是灵活的,我们可以将具有任意网络结构的教师的知识提取到任意学生网络。我们的经验结果表明,MIMKD在能力不同的学生-教师对口中,在不同的结构中,在学生网络能力极低的情况下,超越了相互竞争的方法。我们能够利用ResNet-34的教师网络,从69.8%的基线精度中提取到Shufflenet-50的精度,在CFARFAR100上获得74.55%的精度。我们利用ResNet-34的教师网络,将ResNet-18网络从68.88%提高到70.32%(1.44 ⁇ )。