Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
翻译:事实证明,知识转让是培训神经分类员的一个非常成功的技术:与地面真相数据一起,它使用“教师”网络获得的“特权信息”来培训“学生”网络。据观察,分类员通过知识转让学习得更快、更可靠得多。然而,对这一现象的理论分析很少或根本没有。为了弥补这一差距,我们建议通过使教师和学生与教师提供的PI的兼容性规范化来处理知识转让问题。我们用动态系统理论的工具来证明,当学生是一个非常宽的两层网络时,我们可以在核心系统中分析这些信息,并表明它能够将PI与给定的数据进行相互交织。这种定性为培训错误与学生与教师的能力之间的关系提供了新的说明。文件的另一个贡献是学生网络的趋同性说明。我们证明,教师减少了学生学习所需的迭代数,从而改进了学生的一般化能力。我们给出了相应的实验性分析结果,以验证理论结果和实验性结果。