Pre-training has become a standard paradigm in many computer vision tasks. However, most of the methods are generally designed on the RGB image domain. Due to the discrepancy between the two-dimensional image plane and the three-dimensional space, such pre-trained models fail to perceive spatial information and serve as sub-optimal solutions for 3D-related tasks. To bridge this gap, we aim to learn a spatial-aware visual representation that can describe the three-dimensional space and is more suitable and effective for these tasks. To leverage point clouds, which are much more superior in providing spatial information compared to images, we propose a simple yet effective 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU. Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module to learn a spatial-aware representation from point clouds and an inter-modal feature interaction module to transfer the capability of perceiving spatial information from the point cloud encoder to the image encoder, respectively. Positive pairs for contrastive losses are established by the matching algorithm and the projection matrix. The whole framework is trained in an unsupervised end-to-end fashion. To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets, containing paired camera images and LIDAR point clouds. Codes and models are available at https://github.com/zhyever/SimIPU.
翻译:然而,大多数方法一般都是在 RGB 图像域上设计。由于二维图像平面和三维空间之间的差异,这些经过预先训练的模型无法看到空间信息,不能作为3D相关任务的亚最佳解决方案。为了缩小这一差距,我们的目标是学习一个空间认知的视觉演示,可以描述三维空间,并且更适合和更有效地执行这些任务。要利用在提供空间信息方面比图像要高得多的点云,我们提议一个简单而有效的 2D 图像和 3D 云不受监督的预培训前战略,称为SimPI。具体地说,我们开发了一个多模式对比学习框架,其中包括一个内部空间认知模块,从点云中学习空间认知的表达,以及一个现代地貌互动模块,将空间信息从点云层编码器到图像摄像机的接收能力分别是更适合的。对比性损失的正面配对由匹配的算法和投影模型建立,称为Simippi。我们开发了一个多模式的模型,整个框架是用来进行非超级的学习。