How can prompting a large language model like GPT-3 with explanations improve in-context learning? We focus specifically on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. Including explanations in the prompt and having the model generate them does not consistently improve performance in the settings we study, contrary to recent results on symbolic reasoning tasks (Nye et al., 2021; Wei et al., 2022). Despite careful prompting, explanations generated by GPT-3 may not even be factually grounded in the input, even on simple tasks with straightforward extractive explanations. However, these flawed explanations can still be useful as a way to verify GPT-3's predictions post-hoc. Through analysis in three settings, we show that explanations judged as good by humans--those that are logically consistent with the input and the prediction--usually indicate more accurate predictions. Following these observations, we present a framework for calibrating model predictions based on the reliability of the explanations. Our framework trains calibrators using automatically extracted scores that approximately assess the reliability of explanations, which helps improve performance across three different datasets.
翻译:如何通过解释来推动像GPT-3这样的大型语言模型来改善内流学习?我们具体侧重于涉及文字推理的两个NLP任务,即回答问题和自然语言推论。包括快速解释和使模型产生这些解释,在我们研究的环境下,与关于象征性推理任务的最新结果(Nye等人,2021年;Wei等人,2022年)相反,在我们研究的环境下,没有一贯地改善业绩。尽管仔细地进行了推敲,但GPT-3提出的解释可能甚至没有以输入为事实依据,甚至没有直接的采掘解释的简单任务。然而,这些有缺陷的解释仍然可以作为核实GPT-3预测后热量的一种方法。通过在三个环境中的分析,我们表明,根据与输入和预测逻辑一致的人类因素认为良好的解释,通常表明更准确的预测。根据这些观察,我们提出了一个根据解释的可靠性校准模型预测的框架。我们的框架校准员使用自动提取的分数来评估解释的可靠性,从而帮助改进三个不同的数据集的性。