Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.
翻译:实体和事件对于自然语言推理至关重要,在程序文本中也具有共同性。现有工作的重点要么完全集中在实体国家跟踪(例如,一个平板是否热),要么是事件推理(例如,一个人是否会触摸平板而烧伤自己),而这两项任务往往是因果相关。我们提议CREPE,这是事件合理性和实体状态因果关系推理的第一个基准。我们表明,大多数语言模式,包括GPT-3,在35 F1上几乎接近机会,远远落后于人类,在0.87 F1. 我们通过创造性地将事件作为编程语言,同时推动语言模型在代码上预先培训,将事件模拟性关系和事件作为我们代表的中间推理步骤,将实体之间的因果关系和事件进一步提升到.67 F1。我们的调查结果不仅表明,CREPE对事件推理和实体状态提出的挑战,而且表明,代码推介的功效,加上激励多重事件推理的思维链。