Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.
翻译:无数据蒸馏(KD)使经过培训的神经网络(教师)能够在没有原始培训数据的情况下将知识从经过培训的神经网络(教师)转移到比较紧凑的网络(学生),在没有原始培训数据的情况下,现有工作使用一个验证组来监测学生对真实数据的准确性,并在整个过程中报告最高性能;然而,在蒸馏时间也可能无法提供验证数据,因此无法记录学生达到峰值的快照。因此,实用的无数据KD方法应当稳健,最好在蒸馏期间提供单调的提高学生准确性。这具有挑战性,因为学生由于合成数据的分布变化而经历知识退化。克服这一问题的一个直接办法是定期储存和排练生成的样品,这增加了记忆足迹,并引起隐私问题。我们提议用基因化网络来模拟先前观察过的合成样品的分布,特别是我们设计一个Variational Autencoder (VAE),其培训目标应定制,以最优化地学习合成数据的表述方式。学生可以接受基因化的模拟再演练方法,通过基因化模型来进行排练,同时进行预演练,通过VAE的精度模型将大型的精准性模型进行存储模型的精准,同时显示大型的精准性分析。