Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.
翻译:传统的 3D 场景理解 方法依靠标签的 3D 数据集来训练一个单一任务模式。 我们提出 OpenSceen, 这是一种替代方法, 模型预测三维场景点的密度特征, 与 CLIP 特征空间的文本和图像像素相混合。 这种零点方法可以进行任务认知培训和开放式词汇查询。 例如, 执行 SOTA 零点 3D 语义分割, 它首先将每三维点的 CLIP 特性评为 CLIP 特性, 然后再根据任意类标签的相似性对其进行分类 。 更有趣的是, 它使得一组开放语言场景理解应用程序能够使用一套前所未有的任意文本查询, 然后看到显示场景的哪个部分的热图 。 我们的方法在复杂的三D 场景中, 都使用一个没有标有 3D 数据的培训过的单一模型, 有效识别对象、 材料、 价格、 活动 和房间类型 。