Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.
翻译:阻塞通信是专家混合模型在分布式环境中高效运行的主要障碍。为解决此问题,我们提出FarSkip-Collective,通过修改现代模型架构实现计算与通信的重叠。该方法在模型中引入跳跃连接以调整架构,但修改后的模型架构是否仍能保持原有性能尚不明确,尤其是在处理大型最先进模型并修改所有层时。我们对此给出了肯定答案,成功将一系列参数量从160亿到1090亿的最先进模型完全转换,在实现通信重叠的同时,其准确率与原始开源版本持平。例如,我们通过自蒸馏转换了Llama 4 Scout(1090亿参数),在广泛的下游评估中,其平均准确率与指令调优版本相差不超过1%。除了证明大型修改模型保持准确率外,我们还通过优化实现显式重叠通信与计算,在现有框架中加速训练和推理,从而实现了FarSkip-Collective的效益。