统一学习语言、图像和点云的表示以实现3D理解 (ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding)

The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models are released at https://github.com/salesforce/ULIP.

翻译：当前先进的3D模型的识别能力受到注释数据数量的限制和预定义的类别集市的限制。在2D模块方面，最近的进展表明，通过利用其他模态的知识，例如语言，可以显着减轻类似的问题。受此启发，利用多模态信息进行3D模式识别可以在受限的数据情况下改善3D理解，但是这方面的研究并不完善。因此，我们引入了 ULIP，通过先训练具有语言、图像和3D点云的对象三元组，来学习语言、图像和3D点云的统一表示方法。为了克服训练三元组不足的问题，ULIP利用了一个预先训练的视觉-语言模型，该模型与大量的图像-文本对一起训练，已经学习了一个共同的视觉和文本空间。然后，ULIP使用一小部分自动合成的三元组学习与共同图像-文本空间对齐的3D表示空间。ULIP对3D骨架网络不加偏见，并且可以轻松地集成到任何3D架构中。实验表明，ULIP通过在ShapeNet55上使用我们的框架对多个最新的3D骨干进行简单的预训练，有效地提高了它们的性能，在ModelNet40和ScanObjectNN上实现了标准3D分类和零样本3D分类的最先进性能。 ULIP还将PointMLP的3D分类性能提高了约3％，在ScanObjectNN上弥补了PointCLIP在ModelNet40的零样本3D分类的top-1准确性，提高了28.8％。我们的代码和预先训练的模型在https://github.com/salesforce/ULIP上发布。