通过合成数据防止知识蒸馏过程中的灾难性遗忘和分配不匹配 (Preventing Catastrophic Forgetting and Distribution Mismatch in Knowledge Distillation via Synthetic Data)

With the increasing popularity of deep learning on edge devices, compressing large neural networks to meet the hardware requirements of resource-constrained devices became a significant research direction. Numerous compression methodologies are currently being used to reduce the memory sizes and energy consumption of neural networks. Knowledge distillation (KD) is among such methodologies and it functions by using data samples to transfer the knowledge captured by a large model (teacher) to a smaller one(student). However, due to various reasons, the original training data might not be accessible at the compression stage. Therefore, data-free model compression is an ongoing research problem that has been addressed by various works. In this paper, we point out that catastrophic forgetting is a problem that can potentially be observed in existing data-free distillation methods. Moreover, the sample generation strategies in some of these methods could result in a mismatch between the synthetic and real data distributions. To prevent such problems, we propose a data-free KD framework that maintains a dynamic collection of generated samples over time. Additionally, we add the constraint of matching the real data distribution in sample generation strategies that target maximum information gain. Our experiments demonstrate that we can improve the accuracy of the student models obtained via KD when compared with state-of-the-art approaches on the SVHN, Fashion MNIST and CIFAR100 datasets.

翻译：随着在边缘设备上深层学习越来越受欢迎,压缩大型神经网络以满足资源受限制装置硬件要求的大型神经网络成为一个重要的研究方向。目前正在使用许多压缩方法来减少神经网络的内存尺寸和能量消耗。知识蒸馏(KD)是这种方法之一,它通过使用数据样本将大型模型(教师)所获取的知识转让给较小的模型(学生)而发挥作用。然而,由于各种原因,原始培训数据可能无法在压缩阶段获得。因此,无数据模型压缩是一个持续的研究问题,各种工作已经解决了这一问题。在本文件中,我们指出,灾难性的遗忘是一个问题,在现有的无数据蒸馏方法中可以观察到这一问题。此外,其中一些方法的抽样生成战略可能导致合成和真实数据分配之间的不匹配。为了防止出现这些问题,我们提议了一个无数据KDFAR框架,在以最大程度信息为目标的样本生成战略中,我们增加了匹配真实数据分配的制约因素。我们的实验表明,在使用SFAR-D模型时,我们可以通过S-D改进学生的S-FAR-HS-S-N模型的精确度。