In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for each building and transfer object predictions from 2D to generate pseudo 3D object labels by cross-view consistency. We then fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models. On the SPL metric, our approach improves over state of the art by 7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets respectively.
翻译:在视觉和语言导航(VLN)中,根据自然语言指令,在现实的 3D 环境中导航需要一个内装剂。现有的 VLN 方法的一个主要瓶颈是缺乏足够的培训数据,导致无法令人满意地概括到不可见的环境。虽然VLN 数据通常是手工收集的,但这种方法费用昂贵,防止了可缩放性。在这项工作中,我们建议从来自 HM3D 的900个未标记的 3D 建筑物自动创建大型 VLN 数据集,以解决数据稀缺问题。我们为每座建筑制作一个导航图,并将天体的预测从 2D 生成假的 3D 对象标签,以便通过交叉视图一致性生成假的3D 对象标签。然后我们微调一种预先训练的语言模型,使用假对象标签来缓解教学生成中的跨模式差距。我们由此生成的 HM3D-AutVLN 数据集在导航环境和指示方面比现有的VLN 数据集大得多。我们实验性地证明,HM3D-AutVLN 将大大提高VLN 模型的通用能力,从而获得VLN ISON 的VIAL 的VIAS 的VI 和SAL IPIAS IP IP 。