Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training. This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCL's synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives. We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.
翻译:集体通信算法是分布式计算的一个重要组成部分。 事实上,在深层学习中,集体通信是Amdahl数据平行培训的瓶颈。本文介绍了SCCL(合成集体通信图书馆),这是综合集体通信算法的系统方法,明确针对特定硬件地形学。SCCLL综合了Pareto-frontier 的算法,从长期最佳到带宽最佳集体应用。文件展示了如何将SCCL合成编码为可排放到理论证明的无量化标准标准SMT公式。我们进一步展示了如何通过利用表象学和集体学中的配对法来扩大我们的合成。我们综合并介绍了在文献中未见的关于两种流行硬件表面学的新型嵌套法和带宽最佳算法。我们还展示了SCLL如何高效低的算法,在两种硬件结构(NVIDIA和AMD)上实施。我们展示了与手优化集体通信图书馆的竞争性表现。