Collective communications are an indispensable part of distributed training. Running a topology-aware collective algorithm is crucial for optimizing communication performance by minimizing congestion. Today such algorithms only exist for a small set of simple topologies, limiting the topologies employed in training clusters and handling irregular topologies due to network failures. In this paper, we propose TACOS, an automated topology-aware collective synthesizer for arbitrary input network topologies. TACOS synthesized 3.73x faster All-Reduce algorithm over baselines, and synthesized collective algorithms for 512-NPU system in just 6.1 minutes.
翻译:集体通信是分布式训练不可或缺的一部分。运行拓扑感知集体算法对于通过最小化拥塞来优化通信性能至关重要。如今,这样的算法仅适用于一小部分简单的拓扑结构,限制了训练集群中使用的拓扑结构并处理由于网络故障而产生的不规则拓扑结构。本文提出了 TACOS,一种适用于任意输入网络拓扑的自动化拓扑感知集体合成器。与基线相比,TACOS 合成了 3.73 倍更快的全约减算法,并在仅 6.1 分钟内为 512-NPU 系统合成了集体算法。