Most benchmark datasets targeting commonsense reasoning focus on everyday scenarios: physical knowledge like knowing that you could fill a cup under a waterfall [Talmor et al., 2019], social knowledge like bumping into someone is awkward [Sap et al., 2019], and other generic situations. However, there is a rich space of commonsense inferences anchored to knowledge about specific entities: for example, deciding the truthfulness of a claim "Harry Potter can teach classes on how to fly on a broomstick." Can models learn to combine entity knowledge with commonsense reasoning in this fashion? We introduce CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you're good at a skill you can teach others how to do it). Our dataset consists of 13k human-authored English claims about entities that are either true or false, in addition to a small contrast set. Crowdworkers can easily come up with these statements and human performance on the dataset is high (high 90s); we argue that models should be able to blend entity knowledge and commonsense reasoning to do well here. In our experiments, we focus on the closed-book setting and observe that a baseline model finetuned on existing fact verification benchmark struggles on CREAK. Training a model on CREAK improves accuracy by a substantial margin, but still falls short of human performance. Our benchmark provides a unique probe into natural language understanding models, testing both its ability to retrieve facts (e.g., who teaches at the University of Chicago?) and unstated commonsense knowledge (e.g., butlers do not yell at guests).
翻译:以常识推理为主的多数基准数据集侧重于日常情景:物理知识,比如知道您可以在瀑布[Talmor等人,2019]下填补杯子[Talmor等人,2019],社会知识,比如撞见某人是尴尬的[Sap等人,2019],以及其他一般情况。然而,基于特定实体知识的常识推理空间丰富(例如,决定“Harry Potter可以教授如何在扫帚上飞行的课”这一说法的真实性。模型可以学习将实体知识与常识推理相结合吗?我们引入Crich,这是关于实体知识的常识推理推理的测试床,弥补对实体(Harry Potter是巫师,精通于打扫帚杆)进行事实推理(如果你能教别人如何做到这一点,那么我们的数据集由13k 人所研究的英国模型来判断实体的真实性还是假的。除了一个小对比外, Crowdladwork可以轻松地用这些声明和人类的精确性推理推理理论, 将我们现有的标准推理判标准推算为高。