Knowledge distillation (KD) is an effective training strategy to improve the lightweight student models under the guidance of cumbersome teachers. However, the large architecture difference across the teacher-student pairs limits the distillation gains. In contrast to previous adaptive distillation methods to reduce the teacher-student gap, we explore a novel training-free framework to search for the best student architectures for a given teacher. Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation. Secondly, we find that the similarity of feature semantics and sample relations between random-initialized teacher-student networks have good correlations with final distillation performances. Thus, we efficiently measure similarity matrixs conditioned on the semantic activation maps to select the optimal student via an evolutionary algorithm without any training. In this way, our student architecture search for Distillation WithOut Training (DisWOT) significantly improves the performance of the model in the distillation stage with at least 180$\times$ training acceleration. Additionally, we extend similarity metrics in DisWOT as new distillers and KD-based zero-proxies. Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces. Our project and code are available at https://lilujunai.github.io/DisWOT-CVPR2023/.
翻译:知识蒸馏(KD)是一种有效的训练策略,可以在笨重的教师的指导下改进轻量级的学生模型。然而,教师-学生之间的大型架构差异限制了蒸馏效果。与先前的自适应蒸馏方法相比,我们探索了一种新的无需训练框架,以搜索给定教师最佳学生架构。我们的工作首先通过实验证明,纯净培训下的最优模型不能成为蒸馏的胜者。其次,我们发现,随机初始化的教师和学生网络之间特征语义和样本关系的相似性与最终的蒸馏性能具有很好的相关性。因此,我们通过进化算法在不进行任何训练的情况下测量以语义激活映射为条件的相似度矩阵,选择最佳的学生。这样,我们的无训练蒸馏学生架构搜索(DisWOT)在蒸馏阶段显著提高了模型的性能,并且至少加速180倍的训练。此外,我们将DisWOT中的相似度度量扩展为新的蒸馏器和基于KD的零代理。我们在CIFAR、ImageNet和NAS-Bench-201上的实验表明,我们的技术在不同的搜索空间中实现了最先进的结果。我们的项目和代码可在https://lilujunai.github.io/DisWOT-CVPR2023/ 上获取。