Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only used to provide the supervisory signal for the last layer of the student model, which may result in those shallow student layers lacking accurate training guidance in the layer-by-layer back propagation and thus hinders effective knowledge transfer. To address this issue, we propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers. A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance. Extensive experiments show that the performance of DSKD consistently exceeds state-of-the-art methods on various teacher-student models, confirming the effectiveness of our proposed method.
翻译:知识蒸馏的目的是通过利用受过培训的烦琐教师模式的知识,提高轻量级学生模式的绩效,然而,在传统知识蒸馏过程中,教师的预测仅用于为学生模式最后一层提供监督信号,这可能导致浅层学生在逐层反向传播方面缺乏准确的培训指导,从而妨碍有效的知识转让。为解决这一问题,我们提议“深层支持知识蒸馏”(DSKD),充分利用教师模式的班级预测和特征图来监督浅层学生的培训。DSKD开发了基于流失的重量分配战略,以适应性平衡每个浅层的学习过程,从而进一步改善学生的成绩。广泛的实验表明,DSKD的绩效始终超过各种教师-学生模式的最新方法,证实了我们拟议方法的有效性。