The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models,i.e.,CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works.
翻译:目标是通过任意文本描述来识别新颖物体的开放词汇检测。在本文中,我们通过分而治之的策略解决了开放词汇3D点云检测问题,其中包括:1)开发一个点云检测器,可以学习通用表示以定位各种物体,并且2)连接文本和点云表示,使检测器可以基于文本提示对新颖物体类别进行分类。具体来说,我们利用丰富的图像预训练模型,在2D预训练检测器的预测轮廓框监督下,使点云检测器学习定位物体。此外,我们提出了一种新颖的去偏见三元交叉模式对比学习方法,以连接图像、点云和文本模态,从而使点云检测器受益于视觉语言预训练模型,即CLIP。利用图像和视觉语言预训练模型对点云检测器进行新颖的应用,可以实现开放词汇3D物体检测而无需三维注释。实验表明,所提出的方法在ScanNet和SUN RGB-D数据集上比一系列基线方法提高了至少3.03个和7.47个点。此外,我们提供了全面的分析来解释为什么我们的方法奏效。