In the vision-and-language navigation (VLN) task, an agent follows natural language instructions and navigate in visual environments. Compared to the indoor navigation task that has been broadly studied, navigation in real-life outdoor environments remains a significant challenge with its complicated visual inputs and an insufficient amount of instructions that illustrate the intricate urban scenes. In this paper, we introduce a Multimodal Text Style Transfer (MTST) learning approach to mitigate the problem of data scarcity in outdoor navigation tasks by effectively leveraging external multimodal resources. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 22\% relatively on the test set and achieving new state-of-the-art performance.
翻译:在视觉和语言导航(VLN)任务中,一个代理人在视觉环境中遵循自然语言指令和导航。与经过广泛研究的室内导航任务相比,在现实生活中室外环境中的导航仍是一个重大挑战,因为其视觉投入复杂,说明城市复杂环境的指示数量不足。在本文中,我们引入了多模式文本样式传输(MTST)学习方法,通过有效利用外部多式联运资源,减轻户外导航任务中的数据稀缺问题。我们首先通过转让谷歌地图API生成的指示风格来丰富导航数据,然后用扩大的外部室外导航数据集对导航员进行预先培训。实验结果显示,我们的MTT学习方法是模型级的,而我们的MTT方法大大超越了室外VLN任务的基线模型,在测试中相对提高了22 ⁇ 的任务完成率,并实现了新的最新业绩。