Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
翻译:在本文中,我们探讨对教学进行微调,特别侧重于:(1) 扩大任务数量,(2) 扩大模型规模,(3) 对思维链数据进行微调;我们发现,根据上述方面对教学进行微调,大大改进了各种模型类(PALM、T5、U-PALM)、促进设置(零发、几发、COT)和评价基准(MMLU、BBH、TyDiQA、MGSM、开放一代人)的绩效(MMMLU、BBH、TyDiQA、MGSM、开放一代人)的绩效。例如,Flan-PALM 540B 指令比PALM 540B大幅度(平均+9.4%)的绩效;Flan-PALM 540B 在几个基准上取得了最先进的绩效,例如:75.2%的MMMLU。 我们还公开公布Flan-T5检查站,这些检查站在18K任务中取得了强劲的微分数分数级业绩,甚至改进了整个模型。