Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i.e., entailing an agent to navigate 3D environments through following linguistic instructions. However, current VLN agents simply store their past experiences/observations as latent states in recurrent networks, failing to capture environment layouts and make long-term planning. To address these limitations, we propose a crucial architecture, called Structured Scene Memory (SSM). It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment. SSM has a collect-read controller that adaptively collects information for supporting current decision making and mimics iterative algorithms for long-range reasoning. As SSM provides a complete action space, i.e., all the navigable places on the map, a frontier-exploration based navigation decision making strategy is introduced to enable efficient and global planning. Experiment results on two VLN datasets (i.e., R2R and R4R) show that our method achieves state-of-the-art performance on several metrics.
翻译:最近,为解决视觉语言导航(VLN)问题,开发了许多算法,即需要一名代理人员按照语言指令导航三维环境;然而,目前的VLN代理人员只是将其过去的经验/观察作为潜在状态储存在经常性网络中,无法捕捉环境布局和作出长期规划;为解决这些局限性,我们提议了一个关键的结构,称为结构环境内存(SSSM),它具有足够的分层性,可以在导航过程中精确地记住孔径。它还起到结构化的场景代表作用,在环境中捕捉和分解视觉和几何指示。SSM拥有一个收集式控制器,以适应性的方式收集信息支持当前的决策,并模拟远程推理的迭性算法。由于SSM提供完整的行动空间,即地图上所有通航地点,将引入一个基于边界-勘探的导航决策战略,以便能够高效地进行全球规划。两个VLN数据集(即R2R和R4R4R)的实验结果,显示我们的方法能够实现州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-)。