Recent advances in 3D perception have shown impressive progress in understanding geometric structures of 3Dshapes and even scenes. Inspired by these advances in geometric understanding, we aim to imbue image-based perception with representations learned under geometric constraints. We introduce an approach to learn view-invariant,geometry-aware representations for network pre-training, based on multi-view RGB-D data, that can then be effectively transferred to downstream 2D tasks. We propose to employ contrastive learning under both multi-view im-age constraints and image-geometry constraints to encode3D priors into learned 2D representations. This results not only in improvement over 2D-only representation learning on the image-based tasks of semantic segmentation, instance segmentation, and object detection on real-world in-door datasets, but moreover, provides significant improvement in the low data regime. We show a significant improvement of 6.0% on semantic segmentation on full data as well as 11.9% on 20% data against baselines on ScanNet.
翻译:3D感知的最近进展表明,在理解3D形状甚至场景的几何结构方面取得了令人印象深刻的进展。在几何理解的这些进展的启发下,我们的目标是通过在几何限制下所学的表示来灌输基于图像的认知。我们采用了一种方法来学习网络预培训的视觉-异性、大地测量-辨识表征,根据多视图 RGB-D 数据,然后可以有效地转移到下游的2D 任务。我们提议在多视角的成像限制和图像-地理测量限制下,在将前代编码3D 进行对比性学习,将其转化为已学的2D 表示方式。这不仅改进了对基于图像的分解、实例分解和在室内真实世界的物体探测等任务进行2D 表示式的学习,而且还大大改进了低数据制度。我们表示,在全数据中的语系分化方面有6.0%的显著改善,在扫描网基线上有11.9%的数据方面有显著改善。