Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.
翻译:最近的研究发现,视觉语言导航(VLN)在逼真的环境中训练 RL 代理执行自然语言导航指令是实现让机器人遵循人类指令的一大步。然而,由于人类指令数据稀缺和训练环境差异有限,这些代理仍然难以应对复杂的语言基础和空间语言理解。大规模的互联网文本和图像 - 文本预训练已经被广泛探索,但改进仍然有限。我们研究了合成指令的大规模扩充。我们采用密集采样 360 度全景图捕获的 500 多个室内环境,通过这些全景图构建导航轨迹,并使用高质量多语言导航指令生成器 Marky 为每个轨迹生成视觉基础指令。我们还使用图像 - 图像 GAN 合成新视点的图像观察结果。得到的数据集包含 4.2M 个指令 - 轨迹对,是现有人类注释数据集的两个数量级,包含更广泛的环境和视点。为了有效利用这个规模的数据,我们使用模仿学习训练一个简单的变压器代理。在具有挑战性的 RxR 数据集上,我们的方法优于所有现有 RL 代理,在已知环境中将 NDTW 的现有技术水平从 71.1 提升至 79.1,在未知测试环境中从 64.6 提升至 66.8。我们的工作指出了改进指令跟随代理的新路径,强调了大规模的模仿学习和开发合成指令生成能力。