Vision-Language models like CLIP have been widely adopted for various tasks due to their impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric features as it was trained on only images and text by natural language supervision. We work on addressing this limitation and propose a new framework termed CG3D (CLIP Goes 3D) where a 3D encoder is learned to exhibit zero-shot capabilities. CG3D is trained using triplets of pointclouds, corresponding rendered 2D images, and texts using natural language supervision. To align the features in a multimodal embedding space, we utilize contrastive loss on 3D features obtained from the 3D encoder, as well as visual and text features extracted from CLIP. We note that the natural images used to train CLIP and the rendered 2D images in CG3D have a distribution shift. Attempting to train the visual and text encoder to account for this shift results in catastrophic forgetting and a notable decrease in performance. To solve this, we employ prompt tuning and introduce trainable parameters in the input space to shift CLIP towards the 3D pre-training dataset utilized in CG3D. We extensively test our pre-trained CG3D framework and demonstrate its impressive capabilities in zero-shot, open scene understanding, and retrieval tasks. Further, it also serves as strong starting weights for fine-tuning in downstream 3D recognition tasks.
翻译:视觉-语言模型比如CLIP由于具有印象深刻的零样本能力,已经被广泛用于多种任务中。然而,CLIP无法用于提取三维几何特征,因为它仅仅是通过自然语言监督来训练的图像和文本模型。为了解决这一限制,本研究提出了一个新的框架,称为CG3D(CLIP Goes 3D),其中学习一个3D编码器以表现出零样本能力。CG3D使用自然语言监督下的三元组点云、相应的渲染二维图像和文本进行训练。为了对齐多模态嵌入空间中的特征,我们采用对比损失来学习CG3D中从3D编码器中提取的三维特征以及从CLIP中提取的视觉和文本特征。我们注意到,用于训练CLIP的自然图像和CG3D中呈现的二维渲染图像存在分布偏移。试图训练视觉和文本编码器以纠正这种偏移会导致灾难性的遗忘和显著的性能降低。为了解决这个问题,我们使用提示调整并在输入空间中引入可训练参数,将CLIP向CG3D中使用的三维预训练数据集调整。我们广泛测试了我们预训练的CG3D框架,并展示了它在零样本、开放场景理解和检索任务中的出色能力。此外,它还作为下游三维识别任务微调的强大的起始权重。