Machine learning models are increasingly being trained across multiple GPUs and multiple machines. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in large models. It is important to use efficient algorithms for collective communication. We introduce TACCL, a tool that allows algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses the novel communication sketch abstraction to obtain crucial information from the designer that is used to significantly reduce the state space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the NVIDIA Collective Communication Library (NCCL) by up to 6.7$\times$. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%--2.3$\times$ for different batch sizes.
翻译:正在对多个 GPU 和多个机器的机器学习模型进行越来越多的培训。 在此设置中, 数据在使用AlltoAll和AllReduce等通信集合体的GPU之间传输, 这会成为大型模型中的一个重大瓶颈。 使用高效的算法进行集体通信非常重要。 我们引入了 TACCCL, 这个工具允许算法设计者引导合成器为特定硬件配置和通信集体自动生成算法。 TACCCL 使用新颖的通信草图抽象图从设计师那里获取关键信息, 用于大大减少国家空间, 并引导合成器走向更好的算法。 TACCCL 还使用一个问题的新编码, 使它能够超越单节制表层结构。 我们使用 TACCL 来合成三种集体和两种硬件结构的算法: DGX-2 和 NDv2. 我们证明, TACCL 合成的算法比 NVIDIA 集体通信图书馆( NCL) 的合成算法要高出6.7 $\times 。 我们还显示, TACCL 可以加速以美元到最后$$$$xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx