Knowledge distillation (KD) has shown its effectiveness in improving a student classifier given a suitable teacher. The outpouring of diverse and plentiful pre-trained models may provide abundant teacher resources for KD. However, these models are often trained on different tasks from the student, which requires the student to precisely select the most contributive teacher and enable KD across different label spaces. These restrictions disclose the insufficiency of standard KD and motivate us to study a new paradigm called faculty distillation. Given a group of teachers (faculty), a student needs to select the most relevant teacher and perform generalized knowledge reuse. To this end, we propose to link teacher's task and student's task by optimal transport. Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions by minimizing Sinkhorn distances. The transportation cost also acts as a measurement of teachers' adaptability so that we can rank the teachers efficiently according to their relatedness. Experiments under various settings demonstrate the succinctness and versatility of our method.
翻译:知识蒸馏(KD)已经表明它在改进学生分类方法方面的效力,为合适的教师提供了合适的教师; 多样化和大量培训前的模型的涌现为KD提供了丰富的师资资源。 然而,这些模型往往就学生的不同任务进行培训,这要求学生精确地挑选最费钱的教师,使KD能够跨越不同的标签空间。这些限制暴露了标准的KD的不足,并激励我们学习一种叫作教师蒸馏的新范式。鉴于一组教师(技术),学生需要选择最相关的教师并进行普遍的知识再利用。为此,我们提议以最佳运输方式将教师的任务和学生的任务联系起来。基于其标签空间之间的语义关系,我们可以缩小Sinkhorn距离,从而缩小产出分配之间的支持差距。运输成本还作为衡量教师适应性的一种衡量方法,以便我们可以根据教师的关联性对教师进行高效率的排序。 在各种环境下进行的实验显示了我们的方法的简洁性和多才性。