This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the (cumbersome) teacher model only needs to be partially executed, \ie, forward propagate inputs through the first few layers, for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves downstream performance. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves 84.0% ImageNet top-1 accuracy, outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2%. More intriguingly, our method can robustly distill knowledge from teacher models even with extremely high masking ratios: e.g., with 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B competitively attains a top-1 ImageNet accuracy of 83.6%; surprisingly, it can still secure 82.4% top-1 ImageNet accuracy by aggressively training with just FOUR visible patches (98% masking ratio). The code and models are publicly available at https://github.com/UCSC-VLAA/DMAE.
翻译:本文研究了从预培训模型中提取知识的潜力, 特别是蒙面自动考核器。 我们的方法很简单: 除了优化蒙面输入的像素重建损失, 我们除了优化蒙面输入的像素重建损失之外, 我们还将教师模型中间特征图与学生模型的中间特征图之间的距离最小化。 这个设计导致一个计算高效的知识蒸馏框架, 鉴于 1) 仅使用了少量可见的补丁子集; 2) 仅需要部分执行( 粘积性) 教师模型( 粘结性) 将数据通过前几层传播, 以获取中间特征地图。 比较到直接蒸馏精细调整模型的像素重建损失, 蒸馏预培训模型大大改进了下游业绩。 例如, 通过从 MAE 预培训VT- L 到 VIT- B, 我们的方法达到了84.0% 图像网络头1 的精度, 超过直接蒸馏微调VT-1- L 的基线 1.2 % 。 更令人感兴趣的是, 我们的方法可以从教师模型中强有力地提取知识,, 哪怕是高可见的OUF- 4, AS- AS- AS- dad- dismably main ad lab