CroCo:通过交叉审查完成的3D愿景任务自我监督的预培训 (CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion)

Philippe Weinzaepfel,Vincent Leroy,Thomas Lucas,Romain Brégier,Yohann Cabon,Vaibhav Arora,Leonid Antsfeld,Boris Chidlovskii,Gabriela Csurka,Jérôme Revaud

from arxiv, NeurIPS 2022

Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object detection. In this paper we instead seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks, such as depth prediction or optical flow estimation. Inspired by MIM, we propose an unsupervised representation learning task trained from pairs of images showing the same scene from different viewpoints. More precisely, we propose the pretext task of cross-view completion where the first input image is partially masked, and this masked content has to be reconstructed from the visible content and the second image. In single-view MIM, the masked content often cannot be inferred precisely from the visible portion only, so the model learns to act as a prior influenced by high-level semantics. In contrast, this ambiguity can be resolved with cross-view completion from the second unmasked image, on the condition that the model is able to understand the spatial relationship between the two images. Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks such as depth estimation. In addition, our model can be directly applied to binocular downstream tasks like optical flow or relative camera pose estimation, for which we obtain competitive results without bells and whistles, i.e., using a generic architecture without any task-specific design.

翻译：最近将遮蔽图像模型(MIM)建成了一个强大的训练前前模式。一个托辞任务是通过在输入图像中遮盖补丁,而这种遮蔽内容则由神经网络预测,使用可见补丁作为唯一的输入。这个预训导致在为高层次语义任务(如图像分类和对象检测)进行微调时达到最先进的性能。在本文中,我们寻求的是学习向各种3D愿景和较低水平的下游地平偏差任务(如深度预测或光学流估计)转移的演示。在MIM的启发下,我们建议从一组显示不同视角相同场景的图像中培训出一个不受监督的内向性代表学习任务。更准确地说,我们提出跨视图完成的托辞任务,即第一个输入图像部分被遮蔽,而这一遮蔽的内容必须从可见的内容和第二张图像中重建。在单面MIM中,隐藏的估算内容往往无法仅仅从可见部分得到精确的推断,因此模型可以作为前一种受高层次设计前影响的行为。在高层次的图像中,这种结构中,这种透视界结构可以理解。任何模糊性任务,这种透度任务可以通过一种不透视图状, 。在不透视界结构中可以进行。。。任何透析中可以进行。。