Language models trained on massive prompted multitask datasets like T0 (Sanh et al., 2021) or FLAN (Wei et al., 2021a) can generalize to tasks unseen during training. We show that training on a carefully chosen subset of instances can outperform training on all available data on a variety of datasets. We assume access to a small number (250--1000) of unlabeled target task instances, select their nearest neighbors from a pool of multitask data, and use the retrieved data to train target task-specific models. Our method is more data-efficient than training a single multitask model, while still outperforming it by large margins. We evaluate across a diverse set of tasks not in the multitask pool we retrieve from, including those used to evaluate T0 and additional complex tasks including legal and scientific document QA. We retrieve small subsets of P3 (the collection of prompted datasets from which T0's training data was sampled) and finetune T5 models that outperform the 3-billion parameter variant of T0 (T0-3B) by 3--30% on 12 out of 14 evaluation datasets while using at most 2% of the data used to train T0-3B. These models also provide a better initialization than T0-3B for few-shot finetuning on target-task data, as shown by a 2--23% relative improvement over few-shot finetuned T0-3B models on 8 datasets. Our code is available at https://github.com/allenai/data-efficient-finetuning.
翻译:在大规模推动多任务数据集,如T0(Sanh等人,2021年)或FLAN(Wei等人,2021年a)等大规模推动多任务模型培训的语言模型可以概括到培训期间看不见的任务。我们表明,对精心选择的一组实例的培训,可以优于对各种数据集所有可用数据的培训。我们假设可以访问少量(250-1000)未贴标签的目标任务实例,从多任务数据库中选择最近的邻居,并利用检索的数据来培训特定任务模型。我们的方法比培训一个微小多任务模型(Wei等人,2021年a)或FLAN(Wei等人,2021a)更具有数据效率,同时仍然以大利润率完成。我们从多任务库检索的一组实例中,包括用于评价T0和包括法律和科学文件QA在内的额外复杂任务,我们假设可以访问少量(250-1000个未贴标签目标任务实例的集集,从一个多任务数据集库中抽取其最近的近邻,并利用检索的数据模型来培训特定任务模型。我们的方法比T0-0-3B级模型中3-3-3个参数变换的参数变量。我们用T0-330的模型的数据显示的数据为12个B数据,这些模型,这些数据显示为比T0-3-3-3-3-3-30的初始数据,这些数据显示的数据为14个B数据,在T-3B数据中的数据显示为14个初步数据)。