深入学习中了解组合、知识提炼和自我提炼 (Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning)

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We empirically show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory, especially differently from ensemble of random feature mappings or the neural-tangent-kernel feature mappings, and is potentially out of the scope of existing theorems. Thus, to properly understand ensemble and knowledge distillation in deep learning, we develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble -- that can be used in knowledge distillation -- comparing to the true data labels. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.

翻译：我们正式研究深层次学习模型的组合如何能提高测试准确性,如何利用知识蒸馏将共同元素的优异性能通过知识蒸馏成单一模型。我们考虑了这样一个具有挑战性的情况,即共同元素仅仅是几个独立训练的神经网络中与SAME结构的输出的平均值,在SAME数据集中使用SAME算法进行了培训,它们仅与初始化中使用的随机种子不同。我们从经验中显示,深层次学习模型中的混合/知识蒸馏工作与传统学习理论非常不同,特别是随机特性映射或神经-神经-内核特征映射的组合不同,并且有可能超出现有理论范围。因此,为了正确理解共同元素和知识蒸馏,我们开发了一种理论,当数据结构被我们称为“多视角”时,那么独立训练的神经网络的组合可以改善测试准确性,而这样的高级测试精度也可以被分解成一个单一的组合, 将精度的精度调精准性调精细的精准性调制成一个单一的比喻, 将真实的精细的输出作为我们用于学习结果的模型的模型, 将一个精细的精细的模型, 将一个数字的精细的精细的解的模型, 将一个数字的精细的精细的解的输出变成为一个细的模型, 将一个精细的模型,我们的导的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的模型, 。