Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.
翻译:使用许多图像数据进行预培训后, 许多图像数据已经变得对稳健的 2D 表示失真。 相比之下, 由于昂贵的数据获取和批注, 大型 3D 数据集的缺乏严重妨碍了对高质量 3D 特性的学习。 在本文中, 我们提出一个替代方案, 以便通过名为 I2P- MAE 的图像到点前训练模型从 2D 的2D 模型获得高级 3D 表示。 通过自我监管的训练前, 我们利用2D 高级知识指导 3D 隐藏的自动编码, 以3D 格式重建隐藏的 OD 标记。 对比到随机显示的 OdMOD 显示的 Odminal 表示器, 我们首先使用现出的 2D 模型来提取输入点云云的多视图显示功能, 然后在顶端上进行两种图像到点的学习方案。 首先, 我们引入了 2D 最高级的 IMD 显示到 高级的图像 。