Masked image modelling (e.g., Masked AutoEncoder) and contrastive learning (e.g., Momentum Contrast) have shown impressive performance on unsupervised visual representation learning. This work presents Masked Contrastive Representation Learning (MACRL) for self-supervised visual pre-training. In particular, MACRL leverages the effectiveness of both masked image modelling and contrastive learning. We adopt an asymmetric setting for the siamese network (i.e., encoder-decoder structure in both branches), where one branch with higher mask ratio and stronger data augmentation, while the other adopts weaker data corruptions. We optimize a contrastive learning objective based on the learned features from the encoder in both branches. Furthermore, we minimize the $L_1$ reconstruction loss according to the decoders' outputs. In our experiments, MACRL presents superior results on various vision benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and two other ImageNet subsets. Our framework provides unified insights on self-supervised visual pre-training and future research.
翻译:蒙面图像建模(例如,蒙面自动编码器)和对比式学习(例如,动态对立器)在未经监督的视觉演示学习中表现出令人印象深刻的成绩; 这项工作展示了用于自我监督的视觉预培训的蒙面对比演示学习(MACRL), 特别是, MACRL利用了蒙面图像建模和对比式学习的功效; 我们对两个分支的Siamese网络(例如, 两个分支的编码解码器结构)采用了不对称的设置, 其中一个分支的遮面比率较高,数据增强程度较强,而另一个分支则采用较弱的数据腐败。 我们优化了基于两个分支的编码器所学特点的对比式学习目标。 此外, 根据解码器的产出, 我们尽量减少1美元的重建损失。 在我们的实验中, MACRL 展示了各种视觉基准的优异结果, 包括CIFAR- 10、 CIFAR- 100、 Tiny- ImageNet 和另外两个图像网子集。 我们的框架提供了对自我监督的视觉前和将来研究的统一洞察力。