Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and InstructGPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially. We further show that explanations generated by the LLMs may not entail the models' predictions nor be factually grounded in the input, even on simple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc. Through analysis in our three settings, we show that explanations judged by humans to be good--logically consistent with the input and the prediction--more likely cooccur with accurate predictions. Following these observations, we train calibrators using automatically extracted scores that assess the reliability of explanations, allowing us to improve performance post-hoc across all of our datasets.
翻译:我们研究的是两个NLP任务,其中涉及对文本的推理,即问题回答和自然语言推断。我们测试了三个文本推理数据集的四大LLM的性能,使用的是包括多种不同风格解释的提示。然而,对于这些任务,我们认为,包括LOPs、GPT-3(davinici)和指示GPT(text-davinci-001)的速率解释只能比标准的少到中度的精确性提高。然而,文本-davinci-002能够带来更大的收益。我们进一步表明,LLMS产生的解释可能不包含模型的预测,也不一定以输入为事实依据,甚至以简单的解释方式进行。然而,这些有缺陷的解释对于核实LLMS的预测后期(davinic)仍然有用。通过对我们的三种环境的分析,我们显示,人类认为的解释与标准的短短小到少的短到中等的精确的精确度,我们用所有精确的精确的校准数据校准来进行我们的测。