Feature-based student-teacher learning, a training method that encourages the student's hidden features to mimic those of the teacher network, is empirically successful in transferring the knowledge from a pre-trained teacher network to the student network. Furthermore, recent empirical results demonstrate that, the teacher's features can boost the student network's generalization even when the student's input sample is corrupted by noise. However, there is a lack of theoretical insights into why and when this method of transferring knowledge can be successful between such heterogeneous tasks. We analyze this method theoretically using deep linear networks, and experimentally using nonlinear networks. We identify three vital factors to the success of the method: (1) whether the student is trained to zero training loss; (2) how knowledgeable the teacher is on the clean-input problem; (3) how the teacher decomposes its knowledge in its hidden features. Lack of proper control in any of the three factors leads to failure of the student-teacher learning method.
翻译:以具体特点为基础的师生学习是一种鼓励学生模仿教师网络的隐蔽特征的培训方法,它从经验上成功地将知识从受过训练的教师网络转移到学生网络,此外,最近的实证结果表明,即使学生的投入样本因噪音而腐蚀,教师的特征也能促进学生网络的概括化。然而,对于这种知识传输方法为什么和何时能够成功地完成这种混杂的任务,缺乏理论上的洞察力。我们从理论上用深线网络和实验性地使用非线性网络来分析这种方法。我们查明了该方法成功的三个关键因素:(1) 学生是否受过培训,不会受到任何培训;(2) 教师对清洁投入问题了解的程度;(3) 教师如何将其知识分解到其隐藏的特征中。对这三个因素中任何一个因素缺乏适当的控制导致师生学习方法的失败。