Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.
翻译:微调经过培训的语文模型已成为建设下游国家劳工规划模型的普遍范例。由于数据隐私或知识产权方面的关切,通常很容易获得经过微调的模型,但其培训数据却不是。这为在单个模型中应用知识以产生更好的单一模型制造障碍。在本文中,我们研究将基于不同培训数据集的单个模型合并在一起的问题,以获得一种单一模型,既在所有数据集领域运作良好,又可以推广外部数据。我们提出一种无数据知识集成方法,在参数空间中将模型合并为无数据知识集成,同时以各种加权为指导,最大限度地减少合并模型与单个模型之间的预测差异。在评估环境的积聚中,我们显示拟议方法大大优于诸如渔渔用加权平均值或集成型模型等基线。此外,我们发现,我们的方法是多任务学习的一种有希望的替代方法,可以保存或有时改进单个模型,而不能获得培训数据。最后,模型合并比培训多任务模型更为有效,因此适用于更广泛的假设。