Knowledge Distillation is becoming one of the primary trends among neural network compression algorithms to improve the generalization performance of a smaller student model with guidance from a larger teacher model. This momentous rise in applications of knowledge distillation is accompanied by the introduction of numerous algorithms for distilling the knowledge such as soft targets and hint layers. Despite this advancement in different techniques for distilling the knowledge, the aggregation of different paths for distillation has not been studied comprehensively. This is of particular significance, not only because different paths have different importance, but also due to the fact that some paths might have negative effects on the generalization performance of the student model. Hence, we need to adaptively adjust the importance of each path to maximize the impact of distillation on the student model. In this paper, we explore different approaches for aggregating these different paths and introduce our proposed adaptive approach based on multitask learning methods. We empirically demonstrate the effectiveness of the proposed approach over other baselines on the applications of knowledge distillation in classification, semantic segmentation, and object detection tasks.
翻译:知识蒸馏正在成为神经网络压缩算法的主要趋势之一,目的是在更大的教师模型的指导下改进一个较小的学生模型的普及性表现。在应用知识蒸馏方面的这一显著增长的同时,还引入了多种方法来蒸馏知识,例如软目标及提示层。尽管在蒸馏知识的不同技术方面有了这种进步,但是没有全面研究不同的蒸馏途径的汇总。这特别重要,不仅因为不同路径具有不同的重要性,而且还因为有些路径可能对学生模型的普及性表现产生消极影响。因此,我们需要适应性地调整每一条路径的重要性,以最大限度地扩大蒸馏对学生模型的影响。在本文中,我们探索将这些不同路径集中起来的不同方法,并介绍我们基于多塔式学习方法的拟议适应方法。我们从经验上证明了拟议方法相对于在分类、语义分解和对象探测任务中应用知识蒸馏技术的其他基线的有效性。