Large language models can perform new tasks by adapting to a few in-context examples. For humans, rapid learning from examples can benefit from explanations that connect examples to task principles. We therefore investigate whether explanations of few-shot examples can allow language models to adapt more effectively. We annotate a set of 40 challenging tasks from BIG-Bench with explanations of answers to a small subset of questions, as well as a variety of matched control explanations. We evaluate the effects of various zero-shot and few-shot prompts that include different types of explanations, instructions, and controls on the performance of a range of large language models. We analyze these results using statistical multilevel modeling techniques that account for the nested dependencies among conditions, tasks, prompts, and models. We find that explanations of examples can improve performance. Adding untuned explanations to a few-shot prompt offers a modest improvement in performance; about 1/3 the effect size of adding few-shot examples, but twice the effect size of task instructions. We then show that explanations tuned for performance on a small validation set offer substantially larger benefits; building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone. Hand-tuning explanations can substantially improve performance on challenging tasks. Furthermore, even untuned explanations outperform carefully matched controls, suggesting that the benefits are due to the link between an example and its explanation, rather than lower-level features of the language used. However, only large models can benefit from explanations. In summary, explanations can support the in-context learning abilities of large language models on challenging tasks.
翻译:大型语言模型可以通过适应一些文本中的例子来完成新的任务。对于人类来说,快速从实例中学习可以从将实例与任务原则联系起来的解释中受益。因此,我们调查对几个例子的解释是否能使语言模型更有效地适应。我们注意到BIG-Bench的40项具有挑战性的任务,并解释对一小组问题的答复,以及各种匹配的控制解释。我们评估了各种零点和微点提示的影响,其中包括对一系列大型语言模型绩效的不同解释、指示和控制类型。我们利用统计多层次的模型分析这些结果,其中考虑到条件、任务、提示和模型之间的紧密依赖性。我们发现,对实例的解释可以提高绩效。对少数点的解释加上不协调的解释,可以稍有改进;大约1/3的附加几个例子的效果大小是任务说明的两倍。我们然后表明,对一个小型语言模型的绩效解释可以提供大得多的好处;我们通过选择实例和解释性能来快速分析这些结果。我们发现,在选择大型解释性能方面,能够大大改进对业绩解释性能进行更精确的解释,但仅用得更精确地改进。