For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration -- and no additional training -- matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.
翻译:为使机器人普遍有用,他们必须能够找到人们描述的任意物体(即语言驱动的),即使不进行关于内部数据的昂贵导航培训(即进行零光推断);我们在一个统一的环境下探索这些能力:由语言驱动的零射物体导航(L-Zson);由于图像分类的开放词汇模型最近的成功,我们调查了一个简单的框架,即轮子上的CLIP(COW),使开放词汇模型适应这项任务,而不作微调;为了更好地评估L-ZSON,我们引入了Sasture基准,该基准考虑找到不寻常的物体、空间和外观属性描述的物体,以及与可见物体相对的隐藏物体的隐藏物体;我们通过直接在生境、RobothOR和Pasture各处部署21个COW基线进行深入的实证研究;我们总共评估了90多起导航事件,发现(1) CoW基线常常难以利用语言描述,但无法找到不寻常的物体。 (2) 简单的COW,该基准以CLIP-Robart目标模型为基础,15-M-MZ的本地化和古典级探勘测方法提供了一种相同的数据。