一个新的路径：使用合成指令和模仿学习来扩展视觉语言导航 (A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning)

Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.

翻译：最近的研究发现，视觉语言导航（VLN）在逼真的环境中训练 RL 代理执行自然语言导航指令是实现让机器人遵循人类指令的一大步。然而，由于人类指令数据稀缺和训练环境差异有限，这些代理仍然难以应对复杂的语言基础和空间语言理解。大规模的互联网文本和图像 - 文本预训练已经被广泛探索，但改进仍然有限。我们研究了合成指令的大规模扩充。我们采用密集采样 360 度全景图捕获的 500 多个室内环境，通过这些全景图构建导航轨迹，并使用高质量多语言导航指令生成器 Marky 为每个轨迹生成视觉基础指令。我们还使用图像 - 图像 GAN 合成新视点的图像观察结果。得到的数据集包含 4.2M 个指令 - 轨迹对，是现有人类注释数据集的两个数量级，包含更广泛的环境和视点。为了有效利用这个规模的数据，我们使用模仿学习训练一个简单的变压器代理。在具有挑战性的 RxR 数据集上，我们的方法优于所有现有 RL 代理，在已知环境中将 NDTW 的现有技术水平从 71.1 提升至 79.1，在未知测试环境中从 64.6 提升至 66.8。我们的工作指出了改进指令跟随代理的新路径，强调了大规模的模仿学习和开发合成指令生成能力。

相关内容

模仿学习

关注 322

模仿学习是学习尝试模仿专家行为从而获取最佳性能的一系列任务。目前主流方法包括监督式模仿学习、随机混合迭代学习和数据聚合模拟学习等方法。模仿学习（Imitation Learning）背后的原理是是通过隐含地给学习器关于这个世界的先验信息，比如执行、学习人类行为。在模仿学习任务中，智能体（agent）为了学习到策略从而尽可能像人类专家那样执行一种行为，它会寻找一种最佳的方式来使用由该专家示范的训练集（输入-输出对）。当智能体学习人类行为时，虽然我们也需要使用模仿学习，但实时的行为模拟成本会非常高。与之相反，吴恩达提出的学徒学习（Apprenticeship learning）执行的是存粹的贪婪/利用（exploitative）策略，并使用强化学习方法遍历所有的（状态和行为）轨迹（trajectories）来学习近优化策略。它需要极难的计略（maneuvers），而且几乎不可能从未观察到的状态还原。模仿学习能够处理这些未探索到的状态，所以可为自动驾驶这样的许多任务提供更可靠的通用框架。

【ICML2022】通过评估演示者的专业知识进行模仿学习

专知会员服务

17+阅读 · 2022年7月18日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

专知会员服务

23+阅读 · 2022年3月3日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日