We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").
翻译:我们提出了一个用于学习开放世界天目标导航(ObjectNav)的可扩缩方法, 即要求虚拟机器人( 代理) 在未探索的环境中查找对象( 例如“ 找到一个水槽 ” ) 。 我们的方法完全零射 - 即它不需要“ 目标” 奖赏或任何类型的演示。 相反, 我们对图像目标导航( ImagNav) 任务进行培训, 代理员在其中找到拍摄目标( 即目标图像) 的地点。 具体来说, 我们将目标图像编码成一个mod, 语言嵌入空间, 语言嵌入空间, 语言嵌入空间, 语言嵌入空间, 以在 una 3D 环境( 如 HM3D ) 上进行比例化培训 。 我们的方法是“ 目标- 目标目标导航( 如“ 目标” 目标) 奖赏或任何种类的演示。 培训后, 可以指示Smantic Nav 代理员找到以自由格式自然语言描述的物体( 如“ 嵌入 ”, “ 目标 或“ 挑战 ” 等) 。 通过将语言目标目标目标目标目标投入,, 放在我们更清晰的参照器中,,, 。 通过我们更高级、 嵌嵌嵌嵌入空间中, 、 运行中, 3 等。