Contrastive Language-Image Pre-training (CLIP) has shown promising open-world performance on 2D image tasks, while its transferred capacity on 3D point clouds, i.e., PointCLIP, is still far from satisfactory. In this work, we propose PointCLIP V2, a powerful 3D open-world learner, to fully unleash the potential of CLIP on 3D point cloud data. First, we introduce a realistic shape projection module to generate more realistic depth maps for CLIP's visual encoder, which is quite efficient and narrows the domain gap between projected point clouds with natural images. Second, we leverage large-scale language models to automatically design a more descriptive 3D-semantic prompt for CLIP's textual encoder, instead of the previous hand-crafted one. Without introducing any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. Furthermore, PointCLIP V2 can be extended to few-shot classification, zero-shot part segmentation, and zero-shot 3D object detection in a simple manner, demonstrating our superior generalization ability for 3D open-world learning. Code will be available at https://github.com/yangyangyang127/PointCLIP_V2.
翻译:在2D图像任务(CLIP)上,语言-图像培训前的对比性语言-图像分析显示,在2D图像任务(CLIP)上,开放世界的表现前景大有希望,而在3D点云(即PointCLIIP)上的传输能力仍然远远不令人满意。在这项工作中,我们提议PointCLIIP V2, 一个强大的3D开放世界学习者, 以充分释放CLIP在 3D点云数据上的潜力。 首先,我们引入一个现实的形状投影模块,为CLIP的视觉编码编码(CLIP)绘制更现实的深度地图,这是相当高效的,缩小了以自然图像预测的点云之间的域差。 其次,我们利用大型语言模型自动为CLIP的文本编码编码编码编码(PointCLIP V2)设计一个更描述性更清晰的 3D 符号编码(3GLIP),而不是以前手工制作的编码。我们的方法大大超过PICIP +42.90%, +40.44% 和+28.75% 在三个数据集上 3D 3D 3D 的分类中,可以简单地展示用于检测。