Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction. We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance. This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space.
翻译:大型语言模型能够以几种投入产出示范为条件来完成一项任务 -- -- 一种称为文字内学习的范例。我们表明语言模型可以明确从少数演示中推导出一项基本任务,促使它们产生符合实例的自然语言教学。为了探索这一能力,我们引入了教学上岗挑战,汇编了由24项任务组成的数据集,并根据执行生成的指令确定了新的评价指标。我们发现,在很大程度上,当使用一个既大又符合指令的模型时,生成指令的能力确实出现;指令GPT在基于执行的衡量标准中实现了65.7%的人类性能,而原GPT-3模型只达到人类性能的9.8%。这一令人惊讶的结果表明,教学上岗培训本身可能是一种可行的学习范式,而不是将一套潜在的连续参数与数据相匹配,而是在自然语言假设空间进行最佳描述。