CLIP2Point: 将CLIP转移到点云分类系统,并配有图像部分培训前 (CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training)

Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.

翻译：由于培训数据有限,3D愿景和语言的培训前数据仍在开发中。最近的工作试图将愿景语言培训前模型转换为3D愿景。PointCLLIP将云层数据转换为多视图深度地图,采用CLIP进行形状分类。然而,由于提供深度地图和图像之间的域间差距以及深度分布的多样性,其性能受到限制。为了解决这一问题,我们提议CLIP2Point,这是通过对比学习将 CLIP 传输到 3D 域域的图像深入培训前方法,并把它调整为点云分类。我们引入一个新的深度设置,形成更好的视觉效果,然后从 ShapeNet 提供52,460对图像和深度地图进行多视角深度分类,以进行形状分类。CLIP2 Point的预培训前计划结合了跨模式学习,以实施深度特征来捕捉直观和文字特征,以及内部模式学习,以加强深度汇总的异性。此外,我们提出了一个新的双轨调制(DPA)模块,即双向点定位网络提供更好的视觉效果和深度数据转换,从而简化了CLPROD的C-C-L格式后测试结构,从而可以简化地将C-C-L-C-C-L-C-L-C-GL-C-L-C-C-L-L-C-L-L-L-L-L-L-S-S-S-S-S-S-S-S-S-S-S-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-