We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation-the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that allowing iterative interaction significantly improves the absolute task completion by 18% over the entire test dataset and 31% over the challenging subset. Our results lay the foundation for further investigation of the problem.
翻译:我们提出MUG,这是在用户和代理商在界面屏幕上合作工作的地方进行多式联运的新型互动任务。 先前的模拟多式联运界面以一回合为基础: 用户发出指令, 代理商响应指令。 然而, 在现实的情景下, 当目标行动本身难以以自然语言表达时, 用户指令可能会含糊不清。 MUG允许进行多轮互动, 这样在看到代理方反应后, 用户可以给代理商进一步指令, 以完善甚至纠正其行动。 这种互动对于改善真实世界使用案例中的地面性能至关重要。 为了调查问题, 我们创建了一套由77 820个人类用户和代理商互动序列组成的新数据集, 在其中20%涉及多轮互动的移动界面上, 由人类用户和代理商互动构成。 为了确定我们的基准, 我们试验了一系列模型变式和评价战略, 包括离线和在线评价战略, 由人的评价和自动模拟器组成。 我们的实验显示, 允许迭代互动大大改进了真实任务完成整个测试数据集的绝对任务, 18 % 和具有挑战性的子调查的31 % 。