Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.
翻译:传统的三维场景理解方法依赖于带标签的三维数据集,以监督的方式为单一任务训练模型。本文提出了OpenScene,一种替代方法,其中模型预测在CLIP特征空间中与文本和图像像素共嵌入的三维场景点的密集特征。这种零样本方法实现了与任务无关的训练和开放词汇查询。例如,要执行SOTA的零样本三维语义分割,它首先对每个三维点进行CLIP特征推断,然后基于与任意类标签的嵌入的相似性对它们进行分类。更有趣的是,它允许一系列前所未有的开放词汇的场景理解应用。例如,它允许用户输入任意文本查询,然后看到一个热图,指示哪些场景部分匹配。我们的方法在复杂的三维场景中有效地识别对象、材料、可用性、活动和房间类型,所有这些都使用单个模型,在没有任何带标签的三维数据的情况下进行训练。