Masked image modeling (MIM) has become a popular strategy for self-supervised learning~(SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.
翻译:蒙面图像建模(MIM)已成为一种流行的自我监督学习战略。 一个具有代表性的MIM模型,即蒙面自动编码器(MAE),随机遮盖一组图像补丁,并重建遮面补丁,因为没有遮面的补丁。与此同时,许多在自我监督学习方面的近期工作利用了学生/教师范式,为学生提供了一个基于教师产出的额外目标,该教师由前学生的指数移动平均值构成。虽然对学生和教师之间互动的动态通常知之甚少,但我们通过对简单线性模型的分析发现,教师有条件地删除了以前基于特征相似性的梯度方向,这些相似性有效地作为有条件的动能调节器。通过分析,我们提出了简单的SSLSL方法,即重建- Consistent Macked Aut-Encoder(RC-MAE), 向MAE增派一名EMA教师。我们发现,RCMAE比实际性升级的自我定位方法更快,需要更少的记忆用量,在升级前的自我测试过程中,这种自我升级的自我学习方法可以提供。