Knowledge distillation is classically a procedure where a neural network is trained on the output of another network along with the original targets in order to transfer knowledge between the architectures. The special case of self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. In this paper, we consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. This allows us to provide the first theoretical results on the importance of using the weighted ground-truth targets in self-distillation. Our focus is on fitting nonlinear functions to training data with a weighted mean square error objective function suitable for distillation, subject to $\ell_2$ regularization of the model parameters. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization. Furthermore, we provide a closed form solution for the optimal choice of weighting parameter at each step, and show how to efficiently estimate this weighting parameter for deep learning and significantly reduce the computational requirements compared to a grid search.
翻译:知识蒸馏是典型的一种程序, 神经网络在另一个网络的产出和原始目标上接受培训, 以在结构间转让知识。 自蒸馏的特殊情况, 即网络结构相同, 已经观察到自我蒸馏的特殊情况, 以提高一般化准确性。 在本文中, 我们考虑在内核回归环境中自我蒸馏的迭代变式, 连续步骤既包含模型输出, 也包含地面真实目标。 这使我们能够提供关于在自我蒸馏中使用加权地真象目标的重要性的初步理论结果。 我们的重点是将非线性功能与适合蒸馏的加权平均平方差客观功能相匹配, 但须遵守 $\ ell_ 2$ 的模型参数正规化 。 我们表明, 任何通过自我蒸馏获得的这种函数都可以直接计算为初始适配函数的函数, 并且无限的蒸馏步骤产生与原始的放大正统化的优化问题。 此外, 我们为每个步骤最优选择加权参数提供了封闭的形式解决方案, 并显示如何有效地计算这一深度的搜索参数 。