Some self-supervised cross-modal learning approaches have recently demonstrated the potential of image signals for enhancing point cloud representation. However, it remains a question on how to directly model cross-modal local and global correspondences in a self-supervised fashion. To solve it, we proposed PointCMC, a novel cross-modal method to model multi-scale correspondences across modalities for self-supervised point cloud representation learning. In particular, PointCMC is composed of: (1) a local-to-local (L2L) module that learns local correspondences through optimized cross-modal local geometric features, (2) a local-to-global (L2G) module that aims to learn the correspondences between local and global features across modalities via local-global discrimination, and (3) a global-to-global (G2G) module, which leverages auxiliary global contrastive loss between the point cloud and image to learn high-level semantic correspondences. Extensive experiment results show that our approach outperforms existing state-of-the-art methods in various downstream tasks such as 3D object classification and segmentation. Code will be made publicly available upon acceptance.
翻译:最近,一些自我监督的跨模式学习方法展示了图像信号对于提高点云代表度的潜力,然而,这仍然是一个如何直接以自我监督的方式模拟跨模式的地方和全球通信的问题。为了解决这个问题,我们提议了PointCMC,这是一个全新的跨模式的跨模式方法,用以模拟跨模式的多模式通信,进行自我监督的点云代表度学习。特别是,PointCMC由以下几个方面组成:(1) 地方到地方(L2L)模块,通过优化的跨模式地方几何特征学习地方通信;(2) 地方到全球(L2G)模块,目的是通过地方全球歧视学习不同模式的地方和全球特征之间的通信;(3) 全球(G2G)模块,利用点云和图像之间的辅助性全球对比性损失来学习高层次的语义通信。广泛的实验结果表明,我们的方法在诸如3D物体分类和分区等各种下游任务中超越了现有的状态方法。 守则将在公众接受后公布。