We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on ImageNet, but is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models.
翻译:我们引入了CAN, 这是一种简单、高效和可扩缩的视觉演示自我监督学习方法。 我们的框架是( C) 对比性学习、 (A) 蒙面自动校对器和 (N) 传播模型中使用的噪音预测方法的简单和概念上清洁的合成。 学习机制是相辅相成的: 对比性学习将空间嵌入成一组图像样本; 掩蔽式自动校正器侧重于在单一图像样本中重建低频空间相关; 噪音预测鼓励重建图像的高频组件。 组合法的结果是( C) 对比性、 缩放性和简单到执行的组合法。 培训过程具有对称性, 两种观点中50%的补丁被随机遮盖, 与先前的对比性学习方法相比产生相当大的效率改善。 广泛的实证研究表明, CAN在对传输学习和坚固性任务进行线性和微调性评估后, MAE 和 SimCLR 在图像网络前培训中, 特别有助于在更大比例为 IM4- 300 的图像测试中, J- mal- mal-% 的S- mal- mal- mal- mass- mass- mass- mass- mass- mass- mass- mass- mass- mass- mass- mass-las 实现这样的的进度为: