As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.
翻译:随着大型语言模型(LLMs)的扩大和更加复杂,评估其自然语言“理性”能力的挑战性越来越大。最近的问题回答(QA)基准试图评估推理,往往受到范围狭窄的覆盖情况和主题事项的限制。我们引入了围绕新颖辅助任务的QA数据集WikiWatter:解释答案为何在自然语言中正确。WikiWiwhy基于维基百科事实,包含9,000多个“为什么”问答三重题。每个理由都是将问题与答案联系起来的一套支持性声明。WikiHewyes是LLMs推理能力的一个基准,因为它要求为每项答案提供严格的明确理由,以证明获得隐含的普通知识,而这种知识不可能轻易被记忆化。 GPT-3 基线在端对端回答和解释条件中只达到38.7%的人经过评价的正确度,为将来的改进留下很大空间。