The recent success of pre-trained 2D vision models is mostly attributable to learning from large-scale datasets. However, compared with 2D image datasets, the current pre-training data of 3D point cloud is limited. To overcome this limitation, we propose a knowledge distillation method for 3D point cloud pre-trained models to acquire knowledge directly from the 2D representation learning model, particularly the image encoder of CLIP, through concept alignment. Specifically, we introduce a cross-attention mechanism to extract concept features from 3D point cloud and compare them with the semantic information from 2D images. In this scheme, the point cloud pre-trained models learn directly from rich information contained in 2D teacher models. Extensive experiments demonstrate that the proposed knowledge distillation scheme achieves higher accuracy than the state-of-the-art 3D pre-training methods for synthetic and real-world datasets on downstream tasks, including object classification, object detection, semantic segmentation, and part segmentation.
翻译:最近训练前的2D视觉模型的成功主要归功于从大型数据集中学习。然而,与2D图像数据集相比,目前3D点云的训练前数据有限。为了克服这一限制,我们提议为3D点云的训练前模型采用一种知识蒸馏方法,通过概念一致性,直接从2D代表性学习模型获得知识,特别是CLIP的图像编码器。具体地说,我们引入了一个交叉注意机制,从 3D点云中提取概念特征,并将其与2D 图像的语义信息进行比较。在这个办法中,经过训练的点云预先模型直接从2D 教师模型中的丰富信息中学习。广泛的实验表明,拟议的知识蒸馏计划比关于下游任务的合成和真实世界数据集(包括物体分类、物体探测、语义分割和部分分割)的3D前培训方法更加精确。