Existing 3D scene understanding tasks have achieved high performance on close-set benchmarks but fail to handle novel categories in real-world applications. To this end, we propose a Regional Point-Language Contrastive learning framework, namely RegionPLC, for open-world 3D scene understanding, which equips models trained on closed-set datasets with open-vocabulary recognition capabilities. We propose dense visual prompts to elicit region-level visual-language knowledge from 2D foundation models via captioning, which further allows us to build dense regional point-language associations. Then, we design a point-discriminative contrastive learning objective to enable point-independent learning from captions for dense scene understanding. We conduct extensive experiments on ScanNet, ScanNet200, and nuScenes datasets. Our RegionPLC significantly outperforms previous base-annotated 3D open-world scene understanding approaches by an average of 11.6\% and 6.6\% for semantic and instance segmentation, respectively. It also shows promising open-world results in absence of any human annotation with low training and inference costs. Code will be released.
翻译:现有的3D场景理解任务在封闭集基准上取得了较高的性能,但在实际应用中无法处理新的类别。为此,我们提出了一个区域级点-语言对比学习框架,即RegionPLC,用于开放式3D场景理解,使在封闭集数据集上训练的模型具有开放式词汇的识别能力。我们提出了密集的视觉提示,通过字幕从2D基础模型中引出区域级视觉-语言知识,并进一步建立密集的区域点-语言关联。然后,我们设计了点判别式对比学习目标,使得从字幕中进行点独立学习,用于密集场景理解。我们在ScanNet、ScanNet200和nuScenes数据集上进行了大量实验。我们的RegionPLC在语义分割和实例分割方面的平均表现比以前的基础注释的3D开放式场景理解方法提高了11.6%和6.6%。它还展示了在没有任何人工注释的情况下具有低训练和推理成本的有前途的开放世界结果。我们将发布代码。