Averaging the parameters of models that have the same architecture and initialization can provide a means of combining their respective capabilities. In this paper, we take the perspective that this "merging" operation can be seen as choosing parameters that approximately maximize the joint likelihood of the posteriors of the models' parameters. Computing a simple average of the models' parameters therefore corresponds to making an isotropic Gaussian approximation to their posteriors. We develop an alternative merging procedure based on the Laplace approximation where we approximate each model's posterior as a Gaussian distribution whose precision matrix corresponds to its Fisher information. We first show that our "Fisher merging" technique provides a performance boost in settings where simple parameter averaging is currently used -- specifically, robust fine-tuning and model ensembling. Then, we compare merging to standard gradient-based transfer learning and demonstrate that merging enables a fundamentally different method for transferring capabilities across models. Specifically, we show that Fisher merging is competitive with gradient-based transfer learning approaches (while being significantly cheaper) in intermediate-task training and domain-adaptive pre-training. We also show that our merging procedure makes it possible to combine models in previously unexplored ways. We release our code to facilitate future research into methods for merging models.
翻译:将具有相同架构和初始化的模型参数转换为具有相同架构和初始化的模型参数,可以提供一种整合各自能力的手段。 在本文中,我们从这个角度认为,这种“合并”操作可以被视为选择一些参数,使模型参数的子孙的共同可能性最大化。因此,计算模型参数的简单平均数,相当于将一个异位高萨近似与其子孙相匹配。我们开发了一个基于拉比近点的替代合并程序,我们根据拉比近点将每个模型的后半成像作为高山分布,其精确矩阵与其渔业信息相对应。我们首先显示,在目前使用简单参数平均的环境下,我们的“纤维合并”技术提供了一种性能提升,具体地说,是强有力的微调和模型组合。然后,我们比较了模型的合并,表明合并可以使不同模型之间能力转移的方法根本不同。具体地说,我们表明,在中期任务培训和域价前训练中,渔业整合与基于梯度的转让学习方法(虽然价格非常低)具有竞争力。我们还表明,我们的合并程序有助于将我们未来的研究模式合并,从而可以合并为非勘探模型。