Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
翻译:大型语言模型显示,在促成逐步推理以证明其最终答案时,下游任务业绩有所改善。这些推理步骤极大地改进了模型的解释和核查,但如果没有可靠的自动评价方法,很难客观地研究其正确性(独立于最终答案)。我们只是不知道所说明的推理步骤如何经常地实际支持最终任务预测。在这项工作中,我们介绍了一套可解释的、不受监督的自动评分,改进并扩展了先前的文本生成评价指标。为了对照基线指标评价ROSCOE,我们设计了一种推理错误分类,并收集了常用推理数据集的合成和人评价分数。与现有的衡量标准不同,ROSCOE能够通过利用逐步推理原理的特性衡量语义一致性、逻辑性、信息性、流畅通性和事实质量等特征。我们用经验来核查我们关于5个人的附加说明和6个方案性过敏的诊断数据集的衡量尺度的强度,涵盖各种需要推理技能的任务,并表明ROSCOE能够持续地超越基线指标。