One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.
翻译:自然语言处理(NLP)中最具挑战性的议题之一是视觉定位的语言理解和推理。外观和语言导航(VLN)是一项任务,一个代理机构遵循自然语言指令,导航一个真实的城市环境。由于缺乏说明复杂的城市景点的附加说明的人类指令,户外VLN仍然是一项艰巨的任务。本文引入了多模式文本样式传输(MTST)学习方法,并利用外部多式联运资源来减轻户外导航任务中的数据稀缺。我们首先通过转让谷歌地图 API 生成的指示风格来丰富导航数据,然后用扩大的室外导航数据集对导航器进行预先培训。实验结果表明,我们的MTT学习方法是模型的,而我们的MTT方法大大超越了户外VLN 任务的基准模型,使任务完成率相对比测试集提高了8.7%。