We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a new dataset where the goal is to complete a natural language query in a mobile app. Current datasets for related tasks in interactive question answering, visual common sense reasoning, and question-answer plausibility prediction do not support research in resolving ambiguous natural language requests or operating in diverse digital domains. As a result, they fail to capture complexities of real question answering or interactive tasks. In contrast, MoTIF contains natural language requests that are not satisfiable, the first such work to investigate this issue for interactive vision-language tasks. MoTIF also contains follow up questions for ambiguous queries to enable research on task uncertainty resolution. We introduce task feasibility prediction and propose an initial model which obtains an F1 score of 61.1. We next benchmark task automation with our dataset and find adaptations of prior work perform poorly due to our realistic language requests, obtaining an accuracy of only 20.2% when mapping commands to grounded actions. We analyze performance and gain insight for future work that may bridge the gap between current model ability and what is needed for successful use in application.
翻译:我们引入了具有迭代反馈的移动应用任务(MoTIF),这是一个新的数据集,目标是在移动应用程序中完成自然语言查询。当前用于互动问答、视觉常识推理和问答预测等相关任务的现有数据集,并不支持解决模糊的自然语言请求或在不同数字领域运作的研究。结果,它们未能捕捉到真实回答或互动任务的复杂性。相比之下,MOTIF包含不真实的自然语言请求,这是调查交互式愿景语言任务的第一个此类任务。MOTIF还包含一些关于模糊性询问的后续问题,以便能够对任务不确定性的解决方案进行研究。我们引入了任务可行性预测,并提出了一个初步模型,获得61.1分的F1分。 我们下一个基准任务自动化与我们的数据组合相比,发现先前工作的调整由于我们的现实语言请求而表现不佳,在绘制用于基础行动的指令时,只有20.2%的准确度。我们分析了业绩,并深入了解了未来可能弥合当前模型能力与成功应用所需条件之间差距的工作。