We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. Our approach first feeds the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder. In comparison to previous MIM methods (e.g., BEiT) that couple the encoding and pretext task completion roles, our approach benefits the separation of the representation learning (encoding) role and the pretext task completion role, improving the representation learning capacity and accordingly helping more on downstream tasks. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.
翻译:我们首先从可见的隐蔽图像建模(MIM)方法,环境自动编码器(CAE),为自我监督的演示前训练提供一种新颖的隐蔽图像模型(MIM)方法,环境自动编码器(CAE),目的是通过解密任务对编码器进行预演。目的是通过解密任务进行预演:从可见的隐蔽图像模型(MIM)到代码代表空间的隐蔽图像模型(CAE),目的是通过解密任务(CAE),通过解密空间,将隐蔽的遮蔽图解码器(CAE)进行预演。换句话说,预期的表达方式将位于编码代表空间,从实验上显示代表学习的好处。最后,预设的隐蔽图通过解码器将预设的遮蔽图(例如,BEIT)与前将编码和隐蔽任务完成角色结合起来的方法(例如,BEIT),我们的方法有利于将隐蔽部分与从编码器解析任务中分解,我们为什么要进行更好的演示任务。