The outpouring of various pre-trained models empowers knowledge distillation~(KD) by providing abundant teacher resources. Meanwhile, exploring the massive model repository to select a suitable teacher and further extracting its knowledge become daunting challenges. Standard KD fails to surmount two obstacles when training a student with the availability of plentiful pre-trained teachers, i.e., the "faculty". First, we need to seek out the most contributive teacher in the faculty efficiently rather than enumerating all of them for a student. Second, since the teacher may be pre-trained on different tasks w.r.t. the student, we must distill the knowledge from a more general label space. This paper studies this ``faculty distillation'' where a student performs teacher assessment and generalized knowledge reuse. We take advantage of optimal transport to construct a unifying objective for both problems, which bridges the semantic gap and measures the relatedness between a pair of models. This objective can select the most relevant teacher, and we minimize the same objective over student parameters to transfer the knowledge from the selected teacher subsequently. Experiments in various settings demonstrate the succinctness and versatility of our proposed method.
翻译:通过提供丰富的师资资源,开发各种预先培训的模型,从而增强知识蒸馏(KD)的能力。同时,探索大规模模型库以选择合适的教师并进一步提取其知识成为艰巨的挑战。标准KD在培训学生时未能克服两个障碍,因为有一个学生拥有丰富的事先培训的教师,即“技艺”。首先,我们需要在教师队伍中寻找最有贡献的教师,而不是为学生列出所有教师。第二,由于该教师可能接受关于不同任务的预先培训,我们必须从一个更通用的标签空间中提取知识。本文研究的是学生进行教师评估和普遍知识再利用的“工艺性蒸馏”。我们利用最佳交通为这两个问题构建一个统一的目标,弥合语义差距,衡量一对模式之间的联系。这个目标可以选择最相关的教师,并且我们尽可能减少学生参数的相同目标,以便随后从选定的教师那里传授知识。在各种环境中进行实验,展示我们提议的简明性和多面方法。