The Visual-and-Language Navigation (VLN) task requires understanding a textual instruction to navigate a natural indoor environment using only visual information. While this is a trivial task for most humans, it is still an open problem for AI models. In this work, we hypothesize that poor use of the visual information available is at the core of the low performance of current models. To support this hypothesis, we provide experimental evidence showing that state-of-the-art models are not severely affected when they receive just limited or even no visual data, indicating a strong overfitting to the textual instructions. To encourage a more suitable use of the visual information, we propose a new data augmentation method that fosters the inclusion of more explicit visual information in the generation of textual navigational instructions. Our main intuition is that current VLN datasets include textual instructions that are intended to inform an expert navigator, such as a human, but not a beginner visual navigational agent, such as a randomly initialized DL model. Specifically, to bridge the visual semantic gap of current VLN datasets, we take advantage of metadata available for the Matterport3D dataset that, among others, includes information about object labels that are present in the scenes. Training a state-of-the-art model with the new set of instructions increase its performance by 8% in terms of success rate on unseen environments, demonstrating the advantages of the proposed data augmentation method.
翻译:视觉和语言导航( VLN) 任务要求理解用于导航自然室内环境的文字指令, 仅使用视觉信息。 虽然这对大多数人类来说是一项微不足道的任务, 但对于AI 模型来说,这仍然是一个开放的问题。 在这项工作中, 我们假设现有视觉信息使用不善是当前模型低性能的核心。 为支持这一假设, 我们提供实验性证据表明, 当最先进的模型仅仅收到有限甚至没有视觉数据时, 就不会受到严重影响, 这表明它们与文本指令的匹配性强。 为了鼓励更适当地使用视觉信息, 我们提议一种新的数据增强方法, 从而在生成文本导航指令时, 将更明确的视觉信息纳入到生成中。 我们的主要直觉是, 当前VLN 数据集包含文本指令, 用于向专家导航员( 如人类)提供信息, 但是不是初始的视觉导航代理, 比如随机初始的 DL 模型。 具体地说, 要弥合当前 VLN 数据指令的视觉差距。 为了鼓励更适当使用视觉信息, 我们建议一种新的数据增强方法, 我们利用这个数据增长率 工具的优势来展示其它的图像 。