Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup.
翻译:由于谷歌助理、Alexa和Siri等系统在日常生活中无处不在,因此对以任务为导向的对话的研究兴趣有所增加,然而,由于缺乏能够现实地反映广大用户痛苦点的数据集,这一领域的学术研究影响有限。为了能够对区分现实对话中一些更具挑战性的方面进行研究,我们引入了PRESTO,这是人类与虚拟助理之间550K的多语种公开对话的公开数据集。PRESTO包含在现实世界NLU任务中出现的一系列挑战,如不协调、代码转换和修改。这是人类生成的只有大规模谈话式对称数据集,提供了结构化背景,例如用户的联系人和每个例子的清单。我们的MT5模型基线表明,PRESTO中存在的谈话现象对模型具有挑战性,在低资源设置中更明显。</s>