Knowledge distillation has demonstrated encouraging performances in deep model compression. Most existing approaches, however, require massive labeled data to accomplish the knowledge transfer, making the model compression a cumbersome and costly process. In this paper, we investigate the practical few-shot knowledge distillation scenario, where we assume only a few samples without human annotations are available for each category. To this end, we introduce a principled dual-stage distillation scheme tailored for few-shot data. In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks. In the second step, the trained student blocks are progressively connected and then together grafted onto the teacher network, allowing the learned student blocks to adapt themselves to each other and eventually replace the teacher network. Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012. On CIFAR10 and CIFAR100, our performances are even on par with those of knowledge distillation schemes that utilize the full datasets. The source code is available at https://github.com/zju-vipa/NetGraft.
翻译:在深层模型压缩中,知识蒸馏表现出了令人鼓舞的表现。然而,大多数现有方法都要求大量标签数据,以完成知识转让,使模型压缩成为一个繁琐和昂贵的过程。在本文中,我们调查了实用的微小知识蒸馏情景,我们假设每一类只有少量没有人文说明的样本。为此,我们引入了一种针对少发数据的有原则的双阶段蒸馏计划。在第一步,我们逐个将学生的块块块分到教师身上,并学习与其他教师块块块交织在一起的板块的参数。在第二步,受过训练的学生块逐渐连接起来,然后一起嵌入教师网络,让学习的学生块相互适应,最终取代教师网络。实验表明,我们的方法,只有少量没有标签的样本,在CIRA10、CIFAR100和ILSVRC-2012上取得了令人满意的结果。在CIFAR10和CIFAR100上,我们的表现甚至与利用完整数据设置/GRAmbA的知识和蒸馏计划相同。