Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks. Previous work has shown that scaling the number of training tasks is the key component in making stronger MT LMs. In this work, we report an unexpected finding that an expert LM fine-tuned on just a single task can outperform an MT LM trained with 300+ different tasks on 11 different unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean accuracy of 3.20% and 1.29%, respectively. This finding casts doubt on the previously held belief that simply scaling the number of tasks makes stronger MT LMs. Leveraging this finding, we further show that this distributed approach of training a separate expert LM per training task instead of a single MT LM for zero-shot inference possesses many benefits including (1) avoiding negative task transfer that often occurs during instruction tuning, (2) being able to continually learn new tasks without having to re-train on previous tasks to avoid catastrophic forgetting, and (3) showing compositional capabilities when merging individual experts together. The code is available at https://github.com/joeljang/ELM.
翻译:最近,语言模型(LMS)在多个任务(又称多任务即快速微调(MT))上调整了语言模型(LMS),对多个任务(又称多任务即即快速微调(MT))进行了调整,这表明有能力将一般任务推广到不可见的任务中。先前的工作表明,扩大培训任务的数量是使MTLM更强的关键内容。在这项工作中,我们报告出了一个出人意料的发现,即仅对一项单一任务进行微调的专家LM(LM)比对11个不同的不可见数据集和BIG-Bench基准的13个数据集进行300+不同任务培训的MT LMM(LM)要高得多,平均精确度分别为3.20%和1.29%。这一发现使人怀疑先前的信念,即仅仅增加任务的数量就能使MTLMS更强大。我们进一步表明,这种分散的方法,即培训一个单独的专家LM/每次培训任务,而不是单一的MTLMLM(零点推理算)有许多好处,包括(1) 避免经常发生负面任务转移任务,(2) 能够不断学习新任务,而不必再对先前的任务进行再培训,避免灾难性忘记错记,以及显示组成能力。