Knowledge Distillation (KD) has developed extensively and boosted various tasks. The classical KD method adds the KD loss to the original cross-entropy (CE) loss. We try to decompose the KD loss to explore its relation with the CE loss. Surprisingly, we find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss. However, we notice the extra loss forces the student's relative probability to learn the teacher's absolute probability. Moreover, the sum of the two probabilities is different, making it hard to optimize. To address this issue, we revise the formulation and propose a distributed loss. In addition, we utilize teachers' target output as the soft target, proposing the soft loss. Combining the soft loss and the distributed loss, we propose a new KD loss (NKD). Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD). Our method achieves state-of-the-art performance on CIFAR-100 and ImageNet. For example, with ResNet-34 as the teacher, we boost the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96%. In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively. The code is available at https://github.com/yzd-v/cls_KD.
翻译:知识蒸馏( KD) 广泛发展并提升了各种任务 。 古典 KD 方法将KD 损失加到原始交叉渗透( CE) 损失中 。 我们试图将KD 损失分解为探索与 CE 损失的关系。 令人惊讶的是, 我们发现它可以被视为 CE 损失和额外损失的结合, 其形式与 CE 损失相同。 然而, 我们注意到额外损失迫使学生学习教师绝对概率的相对概率。 此外, 两种概率的总和不同, 使得它难以优化。 为了解决这个问题, 我们修改配方并提议一个分布式损失。 此外, 我们利用教师目标输出作为软目标, 软损失和额外损失的结合, 我们提出一个新的 KD( NKD) 损失。 然而, 我们发现额外损失迫使学生将它作为没有教师培训的软目标, 并且提出了一个新的 KD( tf- NKD) 损失。 。 两种概率的数值是不同的, 使得它很难优化。 为了解决这个问题, 我们的方法在 SIFAR- NEO- 和 RE- RE- RE- RED 。