Knowledge distillation has become one of the most important model compression techniques by distilling knowledge from larger teacher networks to smaller student ones. Although great success has been achieved by prior distillation methods via delicately designing various types of knowledge, they overlook the functional properties of neural networks, which makes the process of applying those techniques to new tasks unreliable and non-trivial. To alleviate such problem, in this paper, we initially leverage Lipschitz continuity to better represent the functional characteristic of neural networks and guide the knowledge distillation process. In particular, we propose a novel Lipschitz Continuity Guided Knowledge Distillation framework to faithfully distill knowledge by minimizing the distance between two neural networks' Lipschitz constants, which enables teacher networks to better regularize student networks and improve the corresponding performance. We derive an explainable approximation algorithm with an explicit theoretical derivation to address the NP-hard problem of calculating the Lipschitz constant. Experimental results have shown that our method outperforms other benchmarks over several knowledge distillation tasks (e.g., classification, segmentation and object detection) on CIFAR-100, ImageNet, and PASCAL VOC datasets.
翻译:通过从较大的教师网络向较小的学生网络提取知识,知识蒸馏已成为最重要的模型压缩技术之一。尽管通过精细设计各种知识的精细设计,先前的蒸馏方法取得了巨大成功,但它们忽视了神经网络的功能特性,这使得将这些技术应用于新的任务的过程不可靠和非三边性。为了缓解这一问题,我们在本文件中利用Lipschitz连续性更好地代表神经网络的功能特征和指导知识蒸馏过程。特别是,我们提议了一个新的Lipschitz连续性引导知识蒸馏框架,通过最大限度地减少两个神经网络的Lipschitz常数之间的距离,从而忠实地提取知识,使教师网络能够更好地规范学生网络,并改进相应的性能。我们得出一个可解释的近似算法,并用明确的理论衍生方法解决计算Lipschitz常数的NP-硬性问题。实验结果显示,我们的方法超越了CIRA-100、图像网和PASAL VOCset数据组若干知识蒸馏任务(例如分类、分解和对象探测)的其他基准。