Training a 3D scene understanding model requires complicated human annotations, which are laborious to collect and result in a model only encoding close-set object semantics. In contrast, vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties. To this end, we propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision. We first modify CLIP's input and forwarding process so that it can be adapted to extract dense pixel features for 3D scene contents. We then project multi-view image features to the point cloud and train a 3D scene understanding model with feature distillation. Without any annotations or additional training, our model achieves promising annotation-free semantic segmentation results on open-vocabulary semantics and long-tailed concepts. Besides, serving as a cross-modal pre-training framework, our method can be used to improve data efficiency during fine-tuning. Our model outperforms previous SOTA methods in various zero-shot and data-efficient learning benchmarks. Most importantly, our model successfully inherits CLIP's rich-structured knowledge, allowing 3D scene understanding models to recognize not only object concepts but also open-world semantics.
翻译:培训 3D 场景理解模型需要复杂的人文说明, 需要复杂的人文说明, 需要收集并产生一个只编码近距离天体语义的模型。 相反, 视觉语言的训练前模型( 如 CLIP) 展示了显著的开放世界推理属性。 为此, 我们提议直接将 CLIP 的特征空间转换为 3D 场景理解模型, 没有任何监督形式。 我们首先修改 CLIP 的输入和转发程序, 以便它能够用于提取 3D 场景内容的密度像素特性。 我们然后将多视图图像特性投射到点云端, 并用特性蒸馏来训练一个 3D 场景理解模型。 最重要的是, 我们的模型在没有任何说明或额外培训的情况下, 在开放式语言语和长的构思概念上, 实现了充满希望的无注释的语义分解结果。 此外, 我们的方法可以在微调过程中用来提高数据效率。 我们的模型比以前的 SOTA 方法在各种零射镜和数据高效学习基准中超越了先前的方法。 。 最重要的是, 我们的模型只继承了CLIP 3Slock 也成功继承了光学 3 。</s>