Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current TOD systems usually focus on multi-turn text/speech interaction, then they would call back-end APIs designed for TODs to perform the task. However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking TOD-specific backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. We also propose a multi-model action prediction and response model, which show promising results on META-GUI. The dataset, codes and leaderboard are publicly available.
翻译:以任务为导向的对话系统(TOD)被移动电话智能助理广泛用于完成日历安排或旅馆预订等任务。当前的TOD系统通常侧重于多方向文本/语音互动,然后它们会调用为TOD设计来执行任务的后端APS。然而,这种基于API的架构极大地限制了智能助理的信息搜索能力,甚至可能导致任务失败,如果托德特定API没有可用,或者任务过于复杂,无法由所提供的API执行。在本文中,我们提出了一个新的TOD结构:基于图形的任务导向对话系统(GUI-TOD)。图形-TOD系统可以直接在实际APS上进行图形化操作,执行任务,而不必使用托德特定的后端API。此外,我们还在移动图形上发布用于培训多模式 convERsat-Tional代理的数据集META-GUI。我们还提议了一个多模式的行动预测和反应模型,显示META-GUI的有希望的结果。数据设置、代码和领导板是公开的。