Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP .
翻译:视觉语言导航是引导一个具有内涵的代理人以自然语言指示的方式在3D场景中导航的任务。对于代理人来说,从视觉语言线索推断出长期导航目标对于可靠的路径规划至关重要,然而,在文献中,这一点以前很少研究过。在本条中,我们提议为长期远视目标引导和室内布局定位导航设计一个目标驱动结构变形仪(TD-STP),用于长期远视目标引导和室内布局导航。具体地说,我们设计了一个想象的景象化机制,用于明确估计长期目标(即使位于未探索的环境中)。此外,我们设计了一个结构变形仪,将探索的室内布局精细地纳入一个神经关注结构,用于结构性和全球规划。实验结果表明,我们的TD-STP在R2R和REverIE基准测试集中分别将以往的最佳成功率提高2%和5%。我们的代码可在https://github.com/YushengZha/TD-STP上查阅。