Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel booking. Current TOD systems usually focus on multi-turn text/speech interaction and reply on calling back-end APIs to search database information or execute the task on mobile phone. However, this architecture greatly limits the information searching capability of intelligent assistants and may even lead to task failure if APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal conversational agent on mobile GUI. We also propose a multi-model action prediction and response model. It showed promising results on META-GUI, but there is still room for further improvement. The dataset and models will be publicly available.
翻译:移动电话智能助理广泛使用以任务为导向的对话系统(TOD)来完成日历安排或旅馆预订等任务。目前的TOD系统通常侧重于多方向文本/语音互动,并回复调用后端API搜索数据库信息或执行移动电话任务。然而,这一结构极大地限制了智能助理的信息搜索能力,甚至可能导致任务失败,如果没有API,或者任务过于复杂,无法由所提供的API执行。我们在本文件中提议一个新的TOD结构:基于图形的任务导向对话系统(GUI-TOD)。图形用户端系统可以直接执行实际APP的图形操作,执行任务,而不必援引后端API。此外,我们发布META-GUI,这是用于培训移动图形多模式对话代理的数据集。我们还提议了一个多模式行动预测和响应模型。它显示了META-GUI的可喜结果,但仍有进一步改进的余地。数据集和模型将公开提供。