Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the target branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The target encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shift for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. Codes will be made publicly available at \url{https://github.com/ZhichengHuang/CMAE}.
翻译:在各种视觉任务中,蒙面图像模型(MIM)取得了令人乐观的成果。然而,所学的数学代表比例的差别性有限,这在提高视力学习者的能力方面仍有大量工作要做。为了实现这一目标,我们提议了“反向蒙面自动编码器”(CMAE),这是一套新的自监督的训练前方法,用于学习更全面和更有能力的视觉表现。通过创新设计,将对比性学习(CL)和蒙面图像模型(MIM)混在一起,CMAE利用各自的竞争力基准,学习其真实性表示方式,同时让CLO值的可辨别性和本地可感性。具体来说,CMAE包括两个分支,在线分支是一个不对称的编码解码器-解码器(CMA),目标分支是更新动力编码。在培训期间,在线编码从掩面图像的深层图像中重建原始图像,学习全方图像,通过对比性能学习,通过在线对准性能学习,提高特征的可视性能性能。CL(CL)与MIM、CMAE引入了两个新的Slody-deal Syal IMADal Speal Sal Sal Stal Stal Stal Statal 。ial dalalation 。ialalalationalationalationalationalaldaldaldaldaldaldaldaldaldaldaldaldaldaldalds 。