Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup.
翻译:近年来,随着Google Assistant、Alexa和Siri等系统在日常生活中变得无处不在,人们对任务导向的对话的研究兴趣不断增加。然而,学术研究在这个领域的影响受到了现实对话数据集的限制,这些数据集往往不能真实地捕捉到用户面临的种种问题。为了促进对解析真实对话中的更具挑战性的方面的研究,我们推出了PRESTO,这是一个公开的数据集,包含超过55万个多语言、情境相关的对话样例,其中包含实际的、不规则的语音、语言切换以及意图修订等问题,涉及到的语言类型广泛多样。它是唯一一个包含每个样例的结构化上下文(如用户的联系人和列表)的大规模人为生成的对话解析数据集。我们基于mT5模型的基线演示了PRESTO中存在的对话现象具有很大的挑战性,尤其在语言资源较少的情况下更加明显。