We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised learning. We randomly partition the image into two sets: visible patches and masked patches. The CAE architecture consists of: (i) an encoder that takes visible patches as input and outputs their latent representations, (ii) a latent context regressor that predicts the masked patch representations from the visible patch representations that are not updated in this regressor, (iii) a decoder that takes the estimated masked patch representations as input and makes predictions for the masked patches, and (iv) an alignment module that aligns the masked patch representation estimation with the masked patch representations computed from the encoder. In comparison to previous MIM methods that couple the encoding and decoding roles, e.g., using a single module in BEiT, our approach attempts to~\emph{separate the encoding role (content understanding) from the decoding role (making predictions for masked patches)} using different modules, improving the content understanding capability. In addition, our approach makes predictions from the visible patches to the masked patches in \emph{the latent representation space} that is expected to take on semantics. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.
翻译:我们为自我监督学习提供了一种新颖的蒙面图像模型(MIM)方法,即环境自动编码器(CAE),用于自我监督学习。我们随机将图像分成两组:可见补丁和遮罩补丁。 CAE结构包括:(一) 一种将可见补丁作为输入和输出其潜在表达方式的编码器,(二) 一种潜在背景递减器,用来预测隐面补丁表示方式的隐面补丁表示方式,而这种表示方式在这个递后方没有更新;(三) 一种解码器,将估计的蒙面补丁表示方式作为输入,并对遮面补补丁作出预测;以及(四) 一个校外调整模块,将遮面补补补代表方式的估算与从编码器中计算出的遮面补补补补补丁表示方式相匹配。 与以前的将编码和解码作用相结合的MIT(如使用BeiT的单一模块),我们的方法试图去缩前(empph{crical) 从解码作用(为遮掩码分析器)到我们目前的解码解释作用(Cloed ),为什么在稳定解释方式中进行更好的补丁解释方式中进行更好的解释。