With the success of deep neural networks, knowledge distillation which guides the learning of a small student network from a large teacher network is being actively studied for model compression and transfer learning. However, few studies have been performed to resolve the poor learning issue of the student network when the student and teacher model sizes significantly differ. In this paper, we propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size to efficiently bridge the large gap between the teacher and student networks. To stimulate more efficient learning of the student network, we guide each teacher assistant to every other smaller teacher assistants iteratively. Specifically, when teaching a smaller teacher assistant at the next step, the existing larger teacher assistants from the previous step are used as well as the teacher network. Moreover, we design stochastic teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped. This acts as a regularizer to improve the efficiency of teaching of the student network. Thus, the student can always learn salient distilled knowledge from the multiple sources. We verified the effectiveness of the proposed method for a classification task using CIFAR-10, CIFAR-100, and ImageNet. We also achieved significant performance improvements with various backbone architectures such as ResNet, WideResNet, and VGG.
翻译:随着深层神经网络的成功,正在积极研究指导大型教师网络中小型学生网络学习的知识蒸馏,以进行模型压缩和转移学习,然而,在学生和教师模型大小差异很大的情况下,为解决学生网络中学习不良的问题进行了很少的研究;在本文件中,我们提议利用多种教师助理进行密集指导的知识蒸馏,逐步缩小模式规模,以有效弥合教师和学生网络之间的巨大差距;为了促进学生网络的更有效学习,我们反复地指导每个教师助理与其他每个较小的教师助理学习。具体地说,在下一个步骤中,在教授一个较小的教师助理时,使用先前步骤中现有的较大教师助理,以及教师网络。此外,我们设计了对每个微型批次的教师或教师助理进行随机丢弃的随机教学方法,作为提高学生网络教学效率的常规工具。因此,学生可以从多个来源学习显著的精炼知识。我们核实了拟议的分类工作方法的有效性,即使用CIFAR-10、CIFAR-100和各种图像网络,我们还利用诸如空间、空间和图像网络等基础设施实现了重大改进。