Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.
翻译:掩蔽自编码器(MAEs)通过随机掩蔽输入图像补丁和重建损失进行无监督表示学习。相比之下,对比学习自监督方法鼓励学习使得同一输入的两个版本具有相似的表示,同时在不同输入之间拉开表示。我们提出了 ViC-MAE,这是一种将 MAE 和对比学习结合起来的通用方法,通过在 MAE 重建目标下汇集本地特征表示,并在对比目标下跨视频帧利用全局表示。我们展示了在 ViC-MAE 下学习的视觉表示通用性很好,可用于视频分类和图像分类任务。使用在 Moments in Time(MiT)数据集上预训练的骨干 ViT-B/16 网络,我们在 Imagenet-1k 上获得了最新一项工作相比绝对 top-1 精度提高了 1.58% 的领先转移学习结果。此外,我们的方法保持了Kinetics-400视频分类基准上 81.50% 的竞争转移学习性能。此外,我们表明,尽管其简单性,ViC-MAE 相对于将 MAE 预训练与先前提出的对比目标(如VicReg和SiamSiam)相结合的方法得到了改进的结果。