Self-supervised pre-training for 3D vision has drawn increasing research interest in recent years. In order to learn informative representations, a lot of previous works exploit invariances of 3D features, e.g., perspective-invariance between views of the same scene, modality-invariance between depth and RGB images, format-invariance between point clouds and voxels. Although they have achieved promising results, previous researches lack a systematic and fair comparison of these invariances. To address this issue, our work, for the first time, introduces a unified framework, under which various pre-training methods can be investigated. We conduct extensive experiments and provide a closer look at the contributions of different invariances in 3D pre-training. Also, we propose a simple but effective method that jointly pre-trains a 3D encoder and a depth map encoder using contrastive learning. Models pre-trained with our method gain significant performance boost in downstream tasks. For instance, a pre-trained VoteNet outperforms previous methods on SUN RGB-D and ScanNet object detection benchmarks with a clear margin.
翻译:近些年来,自我监督的三维愿景培训前培训吸引了越来越多的研究兴趣。为了了解信息,许多以往的工作利用了三维特征的变量,例如:同一场景观点之间的视角差异、深度和 RGB 图像之间的模式差异、点云和氧化物之间的格式差异。虽然取得了可喜的成果,但以往的研究缺乏对这些差异的系统和公正的比较。为解决这一问题,我们的工作首次引入了一个统一框架,据此可以调查各种培训前方法。我们进行了广泛的实验,并更仔细地审视了3D 培训前不同变量的贡献。此外,我们还提出了一个简单而有效的方法,即利用对比性学习,将3D 编码器和深度图解码结合起来。先接受过我们方法培训的模型在下游任务中获得了显著的性能提升。例如,培训前的VoteNet超越了SUN RGB-D 和扫描网络目标探测基准的以往方法。