Knowledge distillation, i.e., one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. So far, however, there is no satisfactory theoretical explanation of this phenomenon. In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Specifically, we prove a generalization bound that establishes fast convergence of the expected risk of a distillation-trained linear classifier. From the bound and its proof we extract three key factors that determine the success of distillation: * data geometry -- geometric properties of the data distribution, in particular class separation, has a direct influence on the convergence speed of the risk; * optimization bias -- gradient descent optimization finds a very favorable minimum of the distillation objective; and * strong monotonicity -- the expected risk of the student classifier always decreases when the size of the training set grows.
翻译:一种知识蒸馏法,即正在用另一个分类器的输出物进行知识蒸馏法培训的一个分类器,是一种经验上非常成功的分类器之间知识转让技术。甚至有人观察到,如果用另一个分类器的输出物作为软标签而不是地面真实数据进行培训,分类器的学习速度更快、更可靠。然而,迄今为止,对于这一现象还没有令人满意的理论解释。在这项工作中,我们通过研究线性和深度线性分类器的特殊案例,对蒸馏工作机制提供了初步的洞察力。具体地说,我们证明,我们有一个概括性约束,可以迅速将蒸馏训练过的线性分类器的预期风险集中起来。从这一界限及其证据中我们提取了三个关键因素,决定蒸馏成功与否:* 数据几何测量法 -- -- 数据分布的几何特性,特别是等级分解,对风险的趋同速度有直接影响;* 优化偏差 -- 梯度下降最有利于蒸馏目标;以及* 强大的单调度 -- -- 预计学生分类器在训练场变大时总是会降低风险。