Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.
翻译:掩码自编码器(MAE)在自监督学习中对于2D和3D计算机视觉都表现出了很高的性能。但是,现有的MAE式方法只能从单一的模态数据(即图像或点云)中学习,这忽略了2D和3D之间的隐含语义和几何关系。在本文中,我们探讨了2D模态如何促进3D掩蔽自编码的发展,并提出了Joint-MAE,一种用于自监督3D点云预训练的2D-3D联合MAE框架。Joint-MAE随机地掩盖输入的3D点云和其投影的2D图像,然后重构两个模态的掩蔽信息。为了更好地进行跨模态交互,我们通过两个分层的2D-3D嵌入模块、共享编码器和模态特化解码器构建了我们的JointMAE。此外,我们进一步引入了两种跨模态策略来提升3D表示学习,即基于局部对齐的2D-3D语义提示机制和基于2D-3D几何约束的交叉重构损失。通过我们的预训练范例,Joint-MAE在多种下游任务上取得了卓越的性能,例如ModelNet40中线性SVM的92.4%准确率和ScanObjectNN最难分裂的86.07%准确率。