Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features. In this paper, we theoretically analyze the knowledge distillation of a wide neural network. First we provide a transfer risk bound for the linearized model of the network. Then we propose a metric of the task's training difficulty, called data inefficiency. Based on this metric, we show that for a perfect teacher, a high ratio of teacher's soft labels can be beneficial. Finally, for the case of imperfect teacher, we find that hard labels can correct teacher's wrong prediction, which explains the practice of mixing hard and soft labels.
翻译:知识蒸馏是一种以教师网络软输出指南对学生网络进行培训的战略。 它是一个成功的模型压缩和知识转移方法。 但是, 目前知识蒸馏缺乏令人信服的理论理解。 另一方面, 最近对神经正切内核的发现使我们能够以网络随机特性的线性模型来接近一个宽度神经网络。 在本文中, 我们从理论上分析宽度神经网络的知识蒸馏。 首先, 我们为网络线性模型提供一个传输风险。 然后我们提出任务培训困难的衡量标准, 称为数据效率不高。 根据这个衡量标准, 我们显示, 对于完美教师来说, 高比例的教师软标签可能是有益的。 最后, 对于不完善的教师来说, 我们发现硬标签可以纠正教师的错误预测, 这解释了混合硬标签和软标签的做法。