Distributed data-parallel training has been widely used for natural language processing (NLP) neural network models. However, the embedding tables in NLP models, holding a large portion of parameters and bringing dramatic sparsity in communication, make it a big challenge to efficiently scale the distributed training. Current distributed training frameworks mainly concentrate on dense models but neglect the sparsity of NLP models, resulting in significant communication overhead and relatively poor scalability. In this paper, we propose EmbRace, an efficient communication framework designed to accelerate sparse communication of distributed NLP model training. EmbRace introduces Sparsity-aware Hybrid Communication, which combines AlltoAll and AllReduce to optimize the communication overhead for sparse and dense data in NLP models. EmbRace further introduces a 2D Communication Scheduling approach to thoroughly overlap communication with computation by optimizing model computation procedure, relaxing the dependency of embeddings, and scheduling communication with a priority queue. We implement EmbRace based on PyTorch and Horovod, and conduct comprehensive evaluations with four representative NLP models on two high-performance GPU clusters. Experimental results show that EmbRace achieves up to 30.66X speedup on 16 GPUs clusters among four popular distributed training baselines.
翻译:在自然语言处理神经网络模型中,广泛使用了分布式数据培训,但是,在自然语言处理(NLP)神经网络模型中嵌入的表格中,拥有大量参数,在通信中带来巨大的广度,这给有效扩大分布式培训带来了巨大的挑战。目前分布式培训框架主要集中于密集模型,但忽视了非LP模型的广度,导致通信管理费用巨大和可扩缩性相对较弱。在本文中,我们提议EmbRace,这是一个高效的通信框架,旨在加速分布式NLP模型培训的稀少通信。EmbRace引入了 " 普及 " 和 " 统称 " 混合通信,以优化NLPAll和 " 模型中稀散和密度数据的通信管理管理。EmbRace还引入了2D通信规划方法,通过优化模型计算程序、放松嵌入的依赖和优先排队安排通信。我们根据PyToirch和Horovod实施了Embrace-awale 混合通信,并用四个具有代表性的NLPP模型进行全面的评价,在两种具有代表性的NLPE-LPO GPU Slavely Slavelment 4 Slax AS AS AS ASy AS AS ASyal ASyal ASyal ASyal ASyal ASyal ASyal 16 AS AS AS AS AS AS AS AS AS AS AS AS AS ASyl AS AS ASyl ASyl ASyl ASyl ASyl ASyl ASyl ASyl ASyl ASyal 16 ASyl ASyl 16 ASyl ASyl ASyl ASyl ASl ASl ASyl ASl ASl 4 ASl ASl ASl ASl ASyl 4 4 4 上,在两个G ASl AS ASl ASl AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS