Knowledge distillation usually transfers the knowledge from a pre-trained cumbersome teacher network to a compact student network, which follows the classical teacher-teaching-student paradigm. Based on this paradigm, previous methods mostly focus on how to efficiently train a better student network for deployment. Different from the existing practices, in this paper, we propose a novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher (for deployment) is learned with the help of multiple hierarchical students by sharing the structural backbone. The diverse feedback from multiple students allows the teacher to improve itself through the shared feature representations. The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet. Notably, when trained together with our proposed method, ResNet-18 achieves 79.15% and 71.14% accuracy on CIFAR-100 and ImageNet, outperforming the baseline results by 4.74% and 1.43%, respectively. The code is available at: https://github.com/zhengli427/TESKD.
翻译:知识蒸馏通常将知识从经过培训的烦琐教师网络转移到紧凑的学生网络,这遵循了古典教师-教师-学生模式。基于这一范例,以往方法主要侧重于如何有效培训更好的学生网络,以便部署。与现有做法不同,我们在本文件中提出了新的学生-助学-教师公式,即通过自学蒸馏的教师演变(TESKD),通过共享结构骨干,在多等级学生的帮助下学习了目标教师(供部署)。来自多个学生的反馈使得教师能够通过共享特征表达方式自我改进。我们拟议框架的有效性通过在两个标准基准基准上对各种网络进行的广泛实验得到证明,包括CIFAR-100和图像网络。值得注意的是,ResNet-18与我们拟议的方法一起培训后,在CIFAR-100和图像网络上实现了79.15%和71.14%的准确率,分别比基准结果高出4.74%和1.43%。代码见https://github.com/zheng427/TESKD。