The pretraining-finetuning paradigm has demonstrated great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field since the training data is limited and point cloud collection is expensive. This paper introduces Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP model. Our EPCL connects the 2D and 3D modalities by semantically aligning the 2D features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a sequence of tokens and directly fed into the frozen CLIP model to learn point cloud representation. Furthermore, we design a task token to narrow the gap between 2D images and 3D point clouds. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the 2D CLIP model can be an efficient point cloud backbone and our method achieves state-of-the-art accuracy on both real-world and synthetic downstream tasks. Code will be available.
翻译:培训前的云调范式在NLP和2D图像领域表现出了巨大成功,因为其经过预先训练的模型具有高质量的代表性和可转让性,然而,由于培训数据有限,而点云收集费用昂贵,因此在3D点云域中很难进行这种强有力的模型的预培训。本文介绍了高效点云学习(EPCL),这是直接用冻结的CLIP模型来培训高点云模型的有效和高效点云学习器。我们的ECL将2D和3D模式连接起来,在2D特征和点云特征上进行语义调整,而没有2D-3D数据配对。具体地说,输入点云被分成一系列符号,直接输入到冻结的CLIP模型中学习点云表。此外,我们设计了一个任务符号,缩小2D图像和3D点云之间的鸿沟。关于3D探测、语界分分、分类和微小的学习的全面实验表明,2D CLIP模式可以成为高效的云脊,我们的方法将实现现实世界和合成下游的状态精确度。