Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.
翻译:视觉和语言导航(VLN)培训RL代理器的近期研究旨在培训RL代理器,以在光现实环境中执行自然语言导航指令,作为向能够遵循人类指令的机器人迈出的一步。然而,由于人类教学数据稀缺,培训环境的多样性有限,这些代理器仍然在复杂的语言定位和空间语言理解方面挣扎。对网络上大型文本和图像文本数据集的预先培训已经进行了广泛探讨,但改进程度有限。我们用合成指令对大规模扩增进行了调查。我们从360度厚度的全景区采集的64个以上的内部环境,通过这些全景区建立导航轨迹,并用高品质的多语种导航指令生成器为每一轨迹制作可见的地面指令。我们还利用图像到图像GAN从新视角综合图像观测图像和图像文本。由此产生的4.2M教学轨迹的数据集比现有的人类附加说明数据集大2级,并包含更广泛的环境和观点。为了高效地利用这一尺度的数据,我们用一个具有挑战性的轨迹谱的轨迹图,1 我们用模拟的轨迹学模型学习了我们现有的数据结构。