Significant memory and computational requirements of large deep neural networks restrict their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the knowledge of a trained large teacher model is transferred to a smaller student model. The success of knowledge distillation is mainly attributed to its training objective function, which exploits the soft-target information (also known as "dark knowledge") besides the given regular hard labels in a training set. However, it is shown in the literature that the larger the gap between the teacher and the student networks, the more difficult is their training using knowledge distillation. To address this shortcoming, we propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently. Our Annealing-KD technique is based on a gradual transition over annealed soft-targets generated by the teacher at different temperatures in an iterative process, and therefore, the student is trained to follow the annealed teacher output in a step-by-step manner. This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method. We did a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and 100) and NLP language inference with BERT-based models on the GLUE benchmark and consistently got superior results.
翻译:大型深心神经网络的重要记忆和计算要求限制了其在边缘设备中的应用。知识蒸馏(KD)是深心神经网络的一个突出的模型压缩技术,在这种技术中,受过训练的大型教师模型的知识被转移到一个较小的学生模型。知识蒸馏的成功主要归功于其培训目标功能,它除了在一组培训中固定的硬标签外,还利用软目标信息(又称“暗知识 ” ) 。然而,文献显示,教师与学生网络之间的差距越大,就越难于利用知识蒸馏来培训他们。为了解决这一缺陷,我们建议一种改进的知识蒸馏方法(称为“Annaaling-KD ”),方法是以渐进和更有效的方式为教师软目标提供的丰富信息提供信息。我们的Annaaling-KD技术基于教师在不同温度、反复过程中生成的无线软目标的逐步过渡。因此,学生接受了培训,以一步一步一步地跟踪教师输出知识蒸馏过程。为了解决这一缺陷,我们提出了一种改进知识蒸馏方法(称为Analing-L)的理论和实验结果,以此作为了我们“B-10”的模型。我们用一种长期的理论和实验方法,把理论和实验结果作为不同的实验,作为一种不同的实验,作为不同的基准,作为一种不同的基准,作为一种不同的实验,作为B-10-RO-RO-RO-RO-RO-RO-RO-RO-RO-RO-RO-RO-L的实验法的实验法的实验方法,作为不同的实验方法,作为一种不同的基准。