Contrastive Tuning: 一种使遮掩自编码器遗忘的小帮助 (Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget)

Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features capture not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAE-CT), a sequential approach that applies Nearest Neighbor Contrastive Learning (NNCLR) to a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Applied to large and huge Vision Transformer (ViT) models, MAE-CT matches or excels previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. Notably, similar results can be achieved without additional image augmentations. While ID methods generally rely on hand-crafted augmentations to avoid shortcut learning, we find that nearest neighbor lookup is sufficient and that this data-driven augmentation effect improves with model size. MAE-CT is compute efficient. For instance, starting from a MAE pre-trained ViT-L/16, MAE-CT increases the ImageNet 1% low-shot accuracy from 67.7% to 72.6%, linear probing accuracy from 76.0% to 80.2% and k-NN accuracy from 60.6% to 79.1% in just five hours using eight A100 GPUs.

翻译：遮掩图像模型（MIM）方法，如遮掩自编码器（MAE），能够高效地学习输入的丰富表示。但是，为了适应下游任务，它们需要足够的标记数据，因为它们的丰富特征不仅捕获对象，还包括不太相关的图像背景。相比之下，实例判别（ID）方法专注于对象。在本研究中，我们研究如何将MIM的效率和可伸缩性与ID在缺少大量标记数据的情况下执行下游分类的能力相结合。为此，我们介绍遮掩自编码器对比调整（MAE-CT），这是一种顺序方法，将最近邻对比学习（NNCLR）应用于预训练的MAE。MAE-CT微调丰富特征，使它们形成对象的语义集群，而不使用任何标签。应用于大型或巨大的Vision Transformer（ViT）模型时，MAE-CT在线性探测、k-NN和低样本分类准确率以及无监督聚类准确率方面匹配或胜过以前在ImageNet上进行自监督训练的方法。值得注意的是，即使不使用其他图像增强，也可以达到类似的结果。虽然ID方法通常依赖于手工增强以避免捷径学习，但我们发现最近邻查询已经足够，并且这种数据驱动的增强效果随着模型大小的增长而提高。MAE-CT具有高效的计算能力。例如，从预训练的ViT-L/16的MAE开始，MAE-CT仅使用八个A100 GPU，在五个小时内将ImageNet 1%低样本准确率从67.7%提高到72.6%，将线性探查准确率从76.0%提高到80.2%，将k-NN准确率从60.6%提高到79.1%。