Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms existing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.
翻译:目前流行的计算机视觉骨干,如视野变换器和ResNet等,都经过培训,从 2D 图像中了解世界。然而,为了更有效地理解 2D 脊柱中的3D 结构前缀,我们建议 Mask3D 将现有的大型 RGB-D 数据运用到一个自我监督的预培训中,以便将这些3D 前缀嵌入 2D 学习的特征演示中。与传统的3D 对比学习模式需要3D 重建或多视图对应,我们的方法很简单:我们通过在 个人 RGB-D 框架中掩码 RGB 和深度补丁来制定一个预文本重建任务。我们证明Msk3D 在将3D 前缀嵌入强大的 2D Vit 脊柱中特别有效,使各种场面理解任务(如语义分解、实例分解和对象探测)的演示学习得以改进。实验显示,Msk3D 明显超越了在 ScscanNet、NYUv2 和城市图象理解任务方面现有的自我监督的3D 前训练方法,同时改进了Sriart-MIO3MU 。</s>