Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them. Most existing methods focus solely on whether the reasoning chain leads to the correct conclusion, but this answer-oriented view may confound the quality of reasoning with other spurious shortcuts to predict the answer. To bridge this gap, we evaluate reasoning chains by viewing them as informal proofs that derive the final answer. Specifically, we propose ReCEval (Reasoning Chain Evaluation), a framework that evaluates reasoning chains through two key properties: (1) correctness, i.e., each step makes a valid inference based on the information contained within the step, preceding steps, and input context, and (2) informativeness, i.e., each step provides new information that is helpful towards deriving the generated answer. We implement ReCEval using natural language inference models and information-theoretic measures. On multiple datasets, ReCEval is highly effective in identifying different types of errors, resulting in notable improvements compared to prior methods. We demonstrate that our informativeness metric captures the expected flow of information in high-quality reasoning chains and we also analyze the impact of previous steps on evaluating correctness and informativeness. Finally, we show that scoring reasoning chains based on ReCEval can improve downstream performance of reasoning tasks. Our code is publicly available at: https://github.com/archiki/ReCEval
翻译:多步推理能力对许多自然语言任务至关重要,但什么构成良好的推理链,以及如何评估它们仍然不清楚。大多数现有方法仅关注推理链是否导致正确的结论,但这种以答案为导向的视角可能会混淆推理的质量与其他虚假的快捷方式,以预测答案。为了弥补这一差距,我们通过将推理链视为推导最终答案的非正式证明,通过评估推理链来对其进行评估。具体而言,我们提出了 ReCEval(推理链评估),一个框架,通过两个关键属性评估推理链:(1)正确性,即每个步骤都基于步骤、先前步骤和输入上下文中包含的信息进行有效的推理;(2)信息性,即每个步骤提供有助于推导生成的答案的新信息。我们使用自然语言推理模型和信息论度量来实现 ReCEval。在多个数据集上,ReCEval 在识别不同类型的错误方面非常有效,与之前的方法相比,结果有显着的改进。我们证明了我们的信息性度量捕获了高质量推理链中预期的信息流,还分析了先前的步骤对于评估正确性和信息性的影响。最后,我们展示了依据 ReCEval 对推理链进行评分可以改善推理任务的下游性能。我们的代码公开在 https://github.com/archiki/ReCEval。