Natural language inference (NLI) is a fundamental NLP task, investigating the entailment relationship between two texts. Popular NLI datasets present the task at sentence-level. While adequate for testing semantic representations, they fall short for testing contextual reasoning over long texts, which is a natural part of the human inference process. We introduce ConTRoL, a new dataset for ConTextual Reasoning over Long texts. Consisting of 8,325 expert-designed "context-hypothesis" pairs with gold labels, ConTRoL is a passage-level NLI dataset with a focus on complex contextual reasoning types such as logical reasoning. It is derived from competitive selection and recruitment test (verbal reasoning test) for police recruitment, with expert level quality. Compared with previous NLI benchmarks, the materials in ConTRoL are much more challenging, involving a range of reasoning types. Empirical results show that state-of-the-art language models perform by far worse than educated humans. Our dataset can also serve as a testing-set for downstream tasks like Checking Factual Correctness of Summaries.
翻译:自然语言推断( NLI) 是一项基本的 NLP 任务, 调查两个文本之间的关联关系。 通用 NLI 数据集在句级上显示任务 。 虽然它们足以测试语义表达方式, 但却不足以测试长文本的背景推理, 这是人类推理过程的自然部分 。 我们引入了ConTRoL, 这是长文本调解新数据集 。 由8 325个专家设计的配有黄金标签的“ 语言对子” 组成, ConTRoL 是一个通过级 NLI 数据集, 重点是复杂的背景推理类型, 如逻辑推理 。 它来自警察招聘的竞争性选择和招聘测试( 口头推理测试), 具有专家级的质量 。 与先前的 NLI 基准相比, ConTRoL 中的材料更具挑战性, 涉及一系列推理类型 。 实证结果显示, 最先进的语言模型比受过教育的人要差得多 。 我们的数据设置也可以作为下游任务测试的设置, 如检查摘要的真实性 。