Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.
翻译:深层神经网络的组合显示出了优异的性能,但是它们高昂的计算成本阻碍了它们应用于资源有限的环境。它促使将混合教师的知识蒸馏成一个较小的学生网络,对于这种组合式蒸馏有两种重要的设计选择:(1) 如何构建学生网络,(2) 培训期间应显示哪些数据。在本文中,我们建议了一种加权平均技术,即拥有多个子网络的学生接受培训,以吸收混合教师的功能多样性,但随后这些子网络被适当平均地用于推断,给单一学生网络带来无额外推论成本。我们还提出了一种扰动战略,以寻求将教师的多样性更好地转移给学生。结合了这两种方法,我们的方法大大改进了以往关于各种图像分类任务的方法。