Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (1) modeling student and teacher output distributions; (2) sampling examples from an approximation to the underlying data distribution; and (3) matching student and teacher output distributions over this extended set including uncertain samples. We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%. This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures: 16% regression error reduction on the MPIIGaze dataset, +3.4% to +9.1% improvement in top-1 classification accuracy on the CIFAR100 dataset, and +2.9% top-1 improvement on the ImageNet dataset.
翻译:利用知识蒸馏法将精密模型(教师)所学知识传授给更简单的模型(学生),这一技术被广泛用于压缩模型复杂性。然而,在大多数应用中,压缩学生模型与其教师存在准确性差距。我们提议了一种新颖的知识蒸馏法,即课外学习,这是一种新颖的知识蒸馏法,通过(1) 模拟师生产出分配;(2) 从近似到基本数据分配的抽样实例;(3) 将学生和教师产出分配与包括不确定样本在内的这一扩展集相匹配。我们严格评估回归和分类任务,并表明与标准知识蒸馏相比,校外学习将差距缩小46%至68%。这导致与最近各种神经网络结构的实验性风险最小化培训相比,准确性显著提高:MPIIGaze数据集的回归误差减少16%,CIFAR100数据集上层1的精确度提高+3.4%至+9.1%,图像网络数据集上层改善2.9%。