RailS：分布式混合专家训练中全对全通信的负载均衡 (RailS: Load Balancing for All-to-All Communication in Distributed Mixture-of-Experts Training)

Training Mixture-of-Experts (MoE) models introduces sparse and highly imbalanced all-to-all communication that dominates iteration time. Conventional load-balancing methods fail to exploit the deterministic topology of Rail architectures, leaving multi-NIC bandwidth underutilized. We present RailS, a distributed load-balancing framework that minimizes all-to-all completion time in MoE training. RailS leverages the Rail topology's symmetry to prove that uniform sending ensures uniform receiving, transforming global coordination into local scheduling. Each node independently executes a Longest Processing Time First (LPT) spraying scheduler to proactively balance traffic using local information. RailS activates N parallel rails for fine-grained, topology-aware multipath transmission. Across synthetic and real-world MoE workloads, RailS improves bus bandwidth by 20%--78% and reduces completion time by 17%--78%. For Mixtral workloads, it shortens iteration time by 18%--40% and achieves near-optimal load balance, fully exploiting architectural parallelism in distributed training.

翻译：训练混合专家（MoE）模型引入了稀疏且高度不均衡的全对全通信，该通信主导了迭代时间。传统的负载均衡方法未能利用Rail架构的确定性拓扑结构，导致多网卡带宽利用率不足。我们提出了RailS，一种分布式负载均衡框架，旨在最小化MoE训练中的全对全完成时间。RailS利用Rail拓扑的对称性，证明了均匀发送可确保均匀接收，从而将全局协调转化为本地调度。每个节点独立执行最长处理时间优先（LPT）喷射调度器，利用本地信息主动平衡流量。RailS激活N条并行轨道，实现细粒度、拓扑感知的多路径传输。在合成和真实世界的MoE工作负载中，RailS将总线带宽提升了20%--78%，并将完成时间减少了17%--78%。对于Mixtral工作负载，它将迭代时间缩短了18%--40%，并实现了近乎最优的负载均衡，充分挖掘了分布式训练中的架构并行性。