Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.
翻译:视频导航(VLN)是一个在视觉环境中遵循语言教学的代理在视觉环境中进行语言教学,在输入指令在环境中是完全可行的前提下进行了研究;然而,在实践中,由于语言模糊或环境变化,请求可能无法提出。要研究VLN,指令可行性未知,我们推出一个新的数据集移动应用程序导航(VLN),目标是在移动应用程序中完成自然语言指令。移动应用程序提供了一个可缩放的域,用于研究VLN方法的真正下游用途。此外,移动应用程序指令为互动导航提供了指导,因为通过点击、打字或擦字导致状态变化的行动序列。移动应用程序将首先包括可行性说明,包括二进制可行性标签和精细的标签,说明为什么任务不满意。我们还收集了后续问题,用于对任务不确定性解决方案的研究。 移动应用程序提供了一个新的可行性预测问题,我们提出了新的可行性预测问题,其中使用自然语言教学和多式应用程序环境来预测更长期的操作能力。 移动定位框架(MOIFFF)首先包括一个更现实的操作环境,我们之前的互动式工作环境,并比任务序列更现实地评估了前的工作环境。