Masked autoencoders have become popular training paradigms for self-supervised visual representation learning. These models randomly mask a portion of the input and reconstruct the masked portion according to the target representations. In this paper, we first show that a careful choice of the target representation is unnecessary for learning good representations, since different targets tend to derive similarly behaved models. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any efforts to carefully design target representations. Interestingly, we further explore using teachers of larger capacity, obtaining distilled students with remarkable transferring ability. On different tasks of classification, transfer learning, object detection, and semantic segmentation, the proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders.
翻译:蒙面自动读取器已成为自我监督的视觉演示学习的流行培训范例。 这些模型随机掩盖部分输入内容,并根据目标表达方式重建蒙面部分。 在本文中,我们首先表明,仔细选择目标表达方式对于学习良好的表达方式是不必要的,因为不同的目标往往产生相似的模范。 受这一观察的驱动,我们提出了一个多阶段蒙面蒸馏管道,并以教师身份随机初始化模式,使我们能够有效地培训高能力模型,而无需努力仔细设计目标表达方式。 有趣的是,我们进一步探索使用能力较大的教师,获得具有显著转移能力的精炼学生。 关于分类、转移学习、对象检测和语义分解的不同任务,与被靴式教师进行蒙面蒸馏的拟议方法(dBOT)优于以往的自我监管方法。 我们希望,我们的发现以及拟议方法能够激励人们重新思考培训前蒙面汽车中的目标表达方式的作用。