Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, we reinterpret expert merging through the lens of game theory, revealing cooperative and competitive dynamics among experts. Based on this perspective, we introduce Nash Merging of Experts (NAMEx), a novel framework that incorporates Nash Bargaining into the merging process, enabling more balanced and efficient collaboration among experts. Additionally, we incorporate complex momentum into NAMEx to accelerate expert propagation with theoretical guarantees for convergence. Extensive experiments across language modelling, text classification, image classification, and zero-shot robustness under data corruption show that NAMEx consistently outperforms competing methods while integrating seamlessly with popular MoE architectures. Finally, we demonstrate NAMEx's scalability by applying it to large-scale systems, including Qwen1.5-MoE (14B) and DeepSeek-MoE (16B), where it proves effective in both zero-shot and fine-tuning settings.
翻译:现有的稀疏专家混合模型专家合并策略通常依赖于专家参数的输入相关或输入无关平均,但往往缺乏一种有理论依据的加权机制。在本工作中,我们通过博弈论的视角重新诠释专家合并,揭示了专家之间合作与竞争的动态关系。基于这一视角,我们引入了纳什专家合并,这是一个将纳什讨价还价融入合并过程的新框架,能够实现专家之间更均衡、更高效的合作。此外,我们在NAMEx中引入了复动量,以加速专家传播,并提供了收敛的理论保证。在语言建模、文本分类、图像分类以及数据损坏下的零样本鲁棒性等广泛实验表明,NAMEx始终优于竞争方法,并能与流行的MoE架构无缝集成。最后,我们通过将NAMEx应用于大规模系统(包括Qwen1.5-MoE和DeepSeek-MoE)展示了其可扩展性,并证明其在零样本和微调设置下均有效。