Smooth and effective communication requires the ability to perform latent or explicit commonsense inference. Prior commonsense reasoning benchmarks (such as SocialIQA and CommonsenseQA) mainly focus on the discriminative task of choosing the right answer from a set of candidates, and do not involve interactive language generation as in dialogue. Moreover, existing dialogue datasets do not explicitly focus on exhibiting commonsense as a facet. In this paper, we present an empirical study of commonsense in dialogue response generation. We first auto-extract commonsensical dialogues from existing dialogue datasets by leveraging ConceptNet, a commonsense knowledge graph. Furthermore, building on social contexts/situations in SocialIQA, we collect a new dialogue dataset with 25K dialogues aimed at exhibiting social commonsense in an interactive setting. We evaluate response generation models trained using these datasets and find that models trained on both extracted and our collected data produce responses that consistently exhibit more commonsense than baselines. Finally we propose an approach for automatic evaluation of commonsense that relies on features derived from ConceptNet and pre-trained language and dialog models, and show reasonable correlation with human evaluation of responses' commonsense quality. We are releasing a subset of our collected data, Commonsense-Dialogues, containing about 11K dialogs.
翻译:先前的常识推理基准(如ScienceIQA和CompensenseQA)主要侧重于从一组候选人中选择正确答案的歧视性任务,而不是像对话那样涉及交互式语言生成;此外,现有的对话数据集并不明确侧重于将常识作为面孔表现。在本文件中,我们介绍了对对话反应生成过程中的常识进行的经验性研究。我们首先通过利用概念网、常识知识图表,从现有对话数据集中自动抽取常识性对话。此外,在社会背景/状况的基础上,我们收集了一套新的对话数据集,其中25K对话的目的是在互动环境中展示社会常识。我们用这些数据集对经过培训的响应生成模型进行评价,发现在提取和我们收集的数据中,所收集的响应模型所产生的反应始终显示比基准更加常见。我们提出了一种基于概念网和预先训练的语言和对话模型的常识性评估方法。我们收集了一个新的对话数据集与共同质量分析。