Sigma-MoE-Tiny 技术报告 (Sigma-Moe-Tiny Technical Report)

Qingguo Hu,Zhenghao Lin,Ziyue Yang,Yucheng Ding,Xiao Liu,Yuting Jiang,Ruizhe Wang,Tianyu Chen,Zhongxin Guo,Yifan Xiong,Rui Gao,Lei Qu,Jinsong Su,Peng Cheng,Yeyun Gong

Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

翻译：专家混合（Mixture-of-Experts，MoE）因其高效且强大的可扩展性，已成为基础模型领域一种前景广阔的范式。在本工作中，我们提出了 Sigma-MoE-Tiny，这是一个 MoE 语言模型，与现有的开源模型相比，其达到了最高的稀疏度。Sigma-MoE-Tiny 采用了细粒度的专家分割，每层最多包含 96 个专家，但每个令牌仅激活一个专家，从而实现了总计 200 亿参数中仅有 5 亿参数被激活。这种极端稀疏性带来的主要挑战在于专家负载均衡。我们发现，在此设置下，广泛使用的负载均衡损失在较低层往往失效。为解决此问题，我们提出了一种渐进式稀疏化调度方案，旨在平衡专家利用率和训练稳定性。Sigma-MoE-Tiny 在一个多样且高质量的语料库上进行了预训练，随后进行了后训练以进一步释放其能力。整个训练过程保持异常稳定，未出现不可恢复的损失尖峰。综合评估表明，尽管仅激活了 5 亿参数，Sigma-MoE-Tiny 在同等规模或显著更大规模的同类模型中取得了顶尖性能。此外，我们对高度稀疏的 MoE 模型中的负载均衡进行了深入探讨，为未来 MoE 架构中推进稀疏性提供了见解。项目页面：https://qghuxmu.github.io/Sigma-MoE-Tiny 代码：https://github.com/microsoft/ltp-megatron-lm