Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1 (671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few demonstrations, the model consistently activates a sparse and stable subset of experts. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and magnitudes of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities after and before routed experts. Experiments show that our method can achieve comparable performances and $2.99\times$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts. Our code is available at https://github.com/RUCAIBox/EASYEP.
翻译:专家混合模型通过仅激活部分专家子集,在性能与推理效率之间实现了有利的权衡。然而,存储所有专家的内存开销仍然是主要限制,尤其是在大规模专家混合模型中,例如DeepSeek-R1。在本研究中,我们探究了大规模专家混合模型中的领域专业化与专家冗余问题,并揭示了一种我们称之为“少量演示专家定位”的一致行为:仅需少量演示,模型便能持续激活一个稀疏且稳定的专家子集。基于这一观察,我们提出了一个简单而有效的剪枝框架EASY-EP,该框架利用少量领域特定的演示来识别并仅保留最相关的专家。EASY-EP包含两个关键组件:输出感知的专家重要性评估与专家级令牌贡献估计。前者通过考虑激活专家的门控分数与输出幅度来评估每个专家对当前令牌的重要性,而后者则根据令牌在路由专家前后的表示相似性来评估其贡献。实验表明,在相同内存预算下,我们的方法仅需一半专家即可实现与完整DeepSeek-R1相当的性能和$2.99\times$的吞吐量。我们的代码可在https://github.com/RUCAIBox/EASYEP获取。