TACCCL: 利用通信策略指导集体分析合成 (TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches)

Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%--2.3x for different batch sizes.

翻译：正在对多个 GPU 和服务器的机器学习模型进行越来越多的培训。在这一设置中,数据在使用AlltoAll和AllRduce等通信集合体的GPU之间传输,这在培训大型模型时可能成为重要的瓶颈。因此,必须使用高效的集体通信算法。我们开发了TACCCL,这是一个工具,使算法设计者能够引导合成器为特定硬件配置和通信集体自动生成算法。TACCCL使用一个新颖的通信草图摘要,从设计者那里获取关键信息,以大幅减少搜索空间,并引导合成器进入更好的算法。TACCCL还使用一个问题的新编码,使问题能够超越单节表表的大小。我们使用TACCL合成了三种集体和两种硬件表层的算法:DGX-2和NDv2. 我们证明,由TACCL合成的算法比Nvidia 集体通信库(NCLL) 高出6.7x。我们还表明,TACCL 可以加速对变压器- X 和BERT 模型的终端到终端培训, 11% 的批次到端到端。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日