Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
翻译:近年来,大型视觉-语言模型的进展推动了基于语言的语义导航的改进,其中具身智能体必须到达自然语言描述的目标物体。然而,我们仍然缺乏一个清晰的、以语言为重点的评估框架来测试智能体对其指令中词语的接地效果。我们通过提出LangNav来解决这一差距,这是一个开放词汇的多目标导航数据集,包含自然语言目标描述(例如“走到桌子上红色的矮蜡烛”)和相应的细粒度语言标注(例如,属性:颜色=红色,尺寸=矮;关系:支撑=在...上)。这些标签支持对语言理解进行系统性评估。为了在此设定下进行评估,我们将多目标导航任务设定扩展为语言引导的多目标导航(LaMoN),其中智能体必须找到一系列使用语言指定的目标。此外,我们提出了多层特征图(MLFM),这是一种新颖的方法,它从预训练的视觉-语言特征构建一个可查询的多层语义地图,并被证明能有效推理目标描述中的细粒度属性和空间关系。在LangNav上的实验表明,MLFM优于最先进的基于零样本建图的导航基线方法。