Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.
翻译:视觉和语言导航(VLN)的目标是使内装代理商能够使用自然语言指令在现实环境中航行。鉴于特定领域培训数据稀缺,图像和语言投入多种多样,将VLN代理商普遍推广到无形环境仍具有挑战性。最近的方法探索了改进通用图像显示数据集或现有小规模VLN环境的使用的预先培训,但这种使用不尽人意,结果有限。在这项工作中,我们采用了BnB(BnB)模型,这是一个大型和多样化的在Domain VLN数据集。我们首先从网上租赁市场的数十万个列表中收集图像显示(IC)配对。我们接下来使用IC配对提出自动战略,以产生数百万VLN路径测量配对。我们进一步提议进行重力损失,以改进PI对内时间秩序的学习。我们使用BnB(B)模型,我们的空控模型只能适应歧视性和基因化环境设置。我们首先从几部远程导航标准中,在VIL前的几部方向上,我们大幅改进了几部导航定位。