We present LGX, a novel algorithm for Object Goal Navigation in a "language-driven, zero-shot manner", where an embodied agent navigates to an arbitrarily described target object in a previously unexplored environment. Our approach leverages the capabilities of Large Language Models (LLMs) for making navigational decisions by mapping the LLMs implicit knowledge about the semantic context of the environment into sequential inputs for robot motion planning. Simultaneously, we also conduct generalized target object detection using a pre-trained Vision-Language grounding model. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline of the OWL-ViT CLIP on Wheels (OWL CoW). Furthermore, we study the usage of LLMs for robot navigation and present an analysis of the various semantic factors affecting model output. Finally, we showcase the benefits of our approach via real-world experiments that indicate the superior performance of LGX when navigating to and detecting visually unique objects.
翻译:我们用“语言驱动的零射”方式展示了LGX,这是“物体目标导航的一种新奇算法”,其中,一个装饰剂在先前尚未探索的环境中,向任意描述的目标对象导航,我们的方法利用大语言模型(LLMs)的能力,通过绘制LLMs对环境语义环境的隐含知识进行绘图,将其纳入机器人运动规划的顺序输入。与此同时,我们还使用预先培训的视野-语言定位地面模型,对目标物体进行普遍探测。我们在RoboTHOR上实现了最先进的零射物体导航结果,成功率(SR)高于OWL-VIT 轮上CLIP(OWLCW)目前基线的27%以上。此外,我们研究了LLMs用于机器人导航的情况,并对影响模型输出的各种语义因素进行了分析。最后,我们通过现实世界实验展示了我们的方法的好处,表明LGX在导航和探测视觉独特物体时的优异性表现。</s>