Knowledge distillation (KD) has become a well established paradigm for compressing deep neural networks. The typical way of conducting knowledge distillation is to train the student network under the supervision of the teacher network to harness the knowledge at one or multiple spots (i.e., layers) in the teacher network. The distillation spots, once specified, will not change for all the training samples, throughout the whole distillation process. In this work, we argue that distillation spots should be adaptive to training samples and distillation epochs. We thus propose a new distillation strategy, termed spot-adaptive KD (SAKD), to adaptively determine the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period. As SAKD actually focuses on "where to distill" instead of "what to distill" that is widely investigated by most existing works, it can be seamlessly integrated into existing distillation methods to further improve their performance. Extensive experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD for improving their distillation performance, under both homogeneous and heterogeneous distillation settings. Code is available at https://github.com/zju-vipa/spot-adaptive-pytorch
翻译:在教师网络的监督下,进行知识蒸馏的典型方式是在教师网络的监督下对学生网络进行培训,以便在教师网络中的一个或多个地点(即层)利用知识; 在整个蒸馏过程中,蒸馏点一旦指定,不会改变所有培训样品的蒸馏点。 在这项工作中,我们主张蒸馏点应适应于对样品和蒸馏器的训练。 因此,我们提出了一个新的蒸馏战略,称为现场施洗式KD(SAKD),以便在教师网络每个样本的每个样本中适应性地确定教师网络的蒸馏点。 SAKD实际上侧重于“ 在哪里蒸馏”而不是“ 蒸馏什么”,因为大多数现有工作都广泛调查过,因此它可以顺利地融入现有的蒸馏方法,以进一步提高它们的性能。 与10个州的蒸馏器(SAKD)进行广泛的实验,以展示教师网络每个样本的蒸馏点的蒸馏点,在整个蒸馏期期间,每个培训中,每个样本的蒸馏点的蒸馏点都是在SAK/stampistrato 中,在Siltotototo SAK正在 SAK/staltototototototo