Language models (LMs) have recently shown remarkable performance on reasoning tasks by explicitly generating intermediate inferences, e.g., chain-of-thought prompting. However, these intermediate inference steps may be inappropriate deductions from the initial context and lead to incorrect final predictions. Here we introduce REFINER, a framework for finetuning LMs to explicitly generate intermediate reasoning steps while interacting with a critic model that provides automated feedback on the reasoning. Specifically, the critic provides structured feedback that the reasoning LM uses to iteratively improve its intermediate arguments. Empirical evaluations of REFINER on three diverse reasoning tasks show significant improvements over baseline LMs of comparable scale. Furthermore, when using GPT3.5 as the reasoner, the trained critic significantly improves reasoning without finetuning the reasoner. Finally, our critic model is trained without expensive human-in-the-loop data but can be substituted with humans at inference time.
翻译:语言模型(LMs)最近在推理任务上表现出了出色的性能,通过明确生成中间推理,例如思考链提示。然而,这些中间推理步骤可能是从初始上下文中不合适的推断,并导致不正确的最终预测。我们介绍了REFINER,这是一个框架,用于微调LMs以明确生成中间推理步骤,同时与提供推理反馈的批判模型进行交互。具体而言,该批评者提供结构化反馈,推理LM使用它来迭代改进其中间论据。对三个不同的推理任务进行的实证评估表明,REFINER在可比规模的基线LMs上显示出显著的改进。此外,当使用GPT3.5作为推理器时,经过训练的批评者可以显著提高推理能力而无需微调推理器。最后,我们的批判模型在不需要消耗昂贵的人在循环数据的情况下进行训练,但可以在推理时用人员替换。