Rational humans can generate sentences that cover a certain set of concepts while describing natural and common scenes. For example, given {apple(noun), tree(noun), pick(verb)}, humans can easily come up with scenes like "a boy is picking an apple from a tree" via their generative commonsense reasoning ability. However, we find this capacity has not been well learned by machines. Most prior works in machine commonsense focus on discriminative reasoning tasks with a multi-choice question answering setting. Herein, we present CommonGen: a challenging dataset for testing generative commonsense reasoning with a constrained text generation task. We collect 37k concept-sets as inputs and 90k human-written sentences as associated outputs. Additionally, we also provide high-quality rationales behind the reasoning process for the development and test sets from the human annotators. We demonstrate the difficulty of the task by examining a wide range of sequence generation methods with both automatic metrics and human evaluation. The state-of-the-art pre-trained generation model, UniLM, is still far from human performance in this task. Our data and code is publicly available at http://inklab.usc.edu/CommonGen/ .
翻译:理性人可以在描述自然和普通场景的同时产生包含特定概念的句子。 例如,考虑到 {apple( noun), 树(noun), 树(noun), 摘取(verb)}, 人类很容易通过其基因化常识推理能力出现“ 男孩正在从树上摘苹果” 这样的场景。 然而, 我们发现机器并没有很好地学习到这种能力。 大多数以前在机器常识中的工作都侧重于带有多种选择问题解答设置的歧视性推理任务。 在这里, 我们介绍 CommonGen: 一个具有挑战性的数据集, 用于测试带有约束性文本生成任务的基因化常识推理。 我们收集了37k 个概念集, 并收集了90k 个人类写成的句子, 作为相关产出。 此外, 我们还为人类说明开发和测试集集的推理过程提供了高质量的理由。 我们用自动计量和人文评估来展示任务难度, 通过考察一系列的序列生成方法来证明任务。 UniLM: 州-art 预训练的生成模型仍然远离人类的人类性表现。 我们的数据和代码在 http/ commusion/ sal.