Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%--2.3x for different batch sizes.
翻译:正在对多个 GPU 和服务器的机器学习模型进行越来越多的培训。 在这一设置中,数据在使用AlltoAll和AllRduce等通信集合体的GPU之间传输,这在培训大型模型时可能成为重要的瓶颈。因此,必须使用高效的集体通信算法。我们开发了TACCCL,这是一个工具,使算法设计者能够引导合成器为特定硬件配置和通信集体自动生成算法。TACCCL使用一个新颖的通信草图摘要,从设计者那里获取关键信息,以大幅减少搜索空间,并引导合成器进入更好的算法。TACCCL还使用一个问题的新编码,使问题能够超越单节表表的大小。我们使用TACCL合成了三种集体和两种硬件表层的算法:DGX-2和NDv2. 我们证明,由TACCL合成的算法比Nvidia 集体通信库(NCLL) 高出6.7x。我们还表明,TACCL 可以加速对变压器- X 和BERT 模型的终端到终端培训, 11% 的批次到端到端。