Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE.
翻译:以多种独立模式的形式学习强大的视觉表现,并实现最新艺术成果,但很少有作品涉及其在多种模式环境中的能力。在这项工作中,我们侧重于点云和RGB图像数据,这两种模式经常在现实世界中一起出现,并探索有意义的互动。为了改进现有作品的跨模式协同作用,我们建议PimaE,一个自我监督的训练前框架,通过三个方面促进3D和2D互动。具体地说,我们首先注意到两个来源之间掩蔽战略的重要性,并利用一个投影模块来补充两种模式的面具和可见符号。然后,我们利用一个精心设计的双分支MAE管道,配有新颖的共享解码器,以促进面纱中的跨模式互动。最后,我们设计了一个独特的跨模式重建模块,以加强两种模式的代表性学习。我们通过在大型 RGB-D 场理解基准(SUN RGB-D 和 ScarmetV2)上进行的广泛实验,我们发现它是非边际的双分支MAEOD管道, 通过互动的2Orma- hestimal- hestorations, 我们发现它分别是互动式的双级的版本和双级标准。</s>