Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
翻译:最近的工作表明,在通过指示(a.k.a.指示调整)描述的任务汇编方面,微调大量经过训练的预先语言模型,改进了对隐性任务的零度和略微的概括性,但对于在指导调整过程中所作不同决定的业绩权衡理解有限,这些决定包括指示调整基准的规模和多样性、不同任务抽样战略、与演示和不演示的微调、使用专门数据集进行推理和对话的培训、以及目标本身的微调。本文描述指示调整决定对下游任务业绩的影响,同时调整模型和基准大小。为此,我们创建了OTP-IM室:2000年NLP(IMS)指令的大型基准,从现有的8个基准合并到任务类别,并制定一个评价框架,以衡量三种类型模式的概观:从完全搁置的类别、从观察的类别中搁置的任务、从观察的类别中搁置的任务,以及从进一步看到的任务。通过这一框架,我们首次提出了关于SMAM-L-L-SL 基准(OFIM) 和SAL-SL-Slimalalalalalalalalal-L) 标准决定的深入了解,每个方向的每个方向都在标准上应用了所有方向上,在标准格式上应用。