Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or training techniques to boost navigation performance. However, there still exist non-negligible gaps between machines' performance and human benchmarks. Moreover, the agents' inner mechanisms for navigation decisions remain unclear. To the best of our knowledge, how the agents perceive the multimodal input is under-studied and needs investigation. In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation. Results show that indoor navigation agents refer to both object and direction tokens when making decisions. In contrast, outdoor navigation agents heavily rely on direction tokens and poorly understand the object tokens. Transformer-based agents acquire a better cross-modal understanding of objects and display strong numerical reasoning ability than non-Transformer-based agents. When it comes to vision-and-language alignments, many models claim that they can align object tokens with specific visual targets. We find unbalanced attention on the vision and text input and doubt the reliability of such cross-modal alignments.
翻译:视觉和语言导航(VLN)是一项多式联运任务,即代理人在视觉环境中遵循自然语言指令和导航。提出了多种设置,研究人员采用新的模型结构或培训技术来提高导航性能。然而,机器性能与人的基准之间仍然存在着不可忽略的差距。此外,代理人用于导航决定的内部机制仍然不明确。据我们所知,代理人如何看待多式联运投入并需要调查。在这项工作中,我们进行了一系列诊断性实验,以揭示代理人在导航过程中的焦点。结果显示,室内导航剂在作出决定时既指对象符号,也指方向符号。相比之下,室外导航剂严重依赖方向符号,不甚了解对象标志。基于变形器的代理人对物体有更好的跨式理解,并显示比非透明代理人有很强的数字推理能力。在视觉和语言调整方面,许多模型声称,它们可以将物体标志与特定视觉目标相匹配。我们在视觉和文字输入时,发现对目标的注意不平衡,怀疑这种跨模式调整的可靠性。