The outpouring of various pre-trained models empowers knowledge distillation by providing abundant teacher resources, but there lacks a developed mechanism to utilize these teachers adequately. With a massive model repository composed of teachers pre-trained on diverse tasks, we must surmount two obstacles when using knowledge distillation to learn a new task. First, given a fixed computing budget, it is not affordable to try each teacher and train the student repeatedly, making it necessary to seek out the most contributive teacher precisely and efficiently. Second, semantic gaps exist between the teachers and the target student since they are trained on different tasks. Thus, we need to extract knowledge from a general label space that may be different from the student's. Faced with these two challenges, we study a new setting named selective cross-task distillation that includes teacher assessment and generalized knowledge reuse. We bridge the teacher's label space and the student's label space through optimal transport. The transportation cost from the teacher's prediction to the student's prediction measures the relatedness between two tasks and acts as an objective for distillation. Our method reuses cross-task knowledge from a distinct label space and efficiently assesses teachers without enumerating the model repository. Experiments demonstrate the effectiveness of our proposed method.
翻译:通过提供丰富的师资资源,对各种经过预先培训的模型进行推广,使知识蒸馏成为了知识蒸馏的动力,但缺乏一种发达的机制来充分利用这些教师。有了一个由接受过不同任务培训的教师组成的庞大的模型库,我们必须在利用知识蒸馏学习新任务时克服两个障碍。首先,考虑到固定的计算预算,我们不能够对每个教师进行试验,反复培训学生,从而有必要精确和有效地寻找最有同情心的教师。第二,教师与目标学生在接受不同任务培训后,在语义上存在差距。因此,我们需要从一个可能不同于学生的一般标签空间提取知识。面对这两个挑战,我们研究一个名为选择性的跨任务蒸馏的新环境,其中包括教师评估和普遍的知识再利用。我们通过最佳的交通将教师的标签空间和学生的标签空间连接起来。从教师的预测到学生的预测的运输成本测量了两项任务和作为蒸馏目标的行为之间的关系。我们的方法是重复利用从一个不同的标签空间和有效评估教师的方法的知识。