Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.
翻译:模型融合已成为联合多任务学习的一种轻量级替代方案,然而融合模型的泛化特性在很大程度上仍未得到充分探索。建立此类理论保证并非易事,因为融合过程通常禁止访问原始训练数据,且涉及组合在本质异构的数据分布上训练的微调模型。若缺乏对这些动态机制的原理性理解,现有方法往往依赖启发式策略来近似参数的最优组合。这种依赖性在系数缩放(即调控每个微调模型对共享参数贡献幅度的权重因子)中最为关键。然而,由于缺乏指导其选择的原理性目标,这些方法会导致脆弱的性能表现,并对缩放初始化高度敏感。我们通过以下方式填补这一空白:(i) 针对模型融合场景建立了新颖的平坦度感知PAC-Bayes泛化界。该分析引入了"跨任务异质性"项,以形式化地刻画不同微调模型先验与目标多任务分布之间的失配。在此理论见解的指导下,(ii) 我们将模型融合框架构建为在稀缺无标注数据上的多教师知识蒸馏问题。我们严格证明了最小化学生-教师Kullback-Leibler散度可直接收紧融合模型超额风险的上界。基于推导的平坦度感知界,(iii) 我们通过SAMerging方法实现了该目标,该方法利用锐度感知最小化技术寻找平坦极小值。实证表明,SAMerging在视觉与自然语言处理基准测试中创造了新的最优性能,取得了显著成果。代码发布于https://github.com/arshandalili/SAMerging。