This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
翻译:本文探索了提高语言模型零点学习能力的简单方法。 我们显示,教学调整 -- -- 微调根据指示描述的任务汇编的语言模型 -- -- 极大地改进了无法完成的任务的零点表现。 我们采用了137B参数预先培训的语言模型,并将它与自然语言教学模板中讲解的60多项NLP任务相匹配。我们用未知任务类型来评价这个我们称之为FLAN的规范调整模式。FLAN大大改进了其未修改对应方的性能,在我们评估的25项任务中的20项任务中超过了175B GPT-3。 FLAN甚至以ANLI、RTE、BolQ、AI2-ARC、OpenbookQA和StoryCloze的较大幅度比微调GPT-3少出几发GPT。 Abl研究表明,微调数据集、模型尺度和自然语言指令的数量是成功的关键。