Distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively low scalability for sparse models like natural language processing (NLP) models that have highly sparse embedding tables. Most existing works overlook the sparsity of model parameters thus suffering from significant but unnecessary communication overhead. In this paper, we propose EmbRace, an efficient communication framework to accelerate communications of distributed training for sparse models. EmbRace introduces Sparsity-aware Hybrid Communication, which integrates AlltoAll and model parallelism into data-parallel training, so as to reduce the communication overhead of highly sparse parameters. To effectively overlap sparse communication with both backward and forward computation, EmbRace further designs a 2D Communication Scheduling approach which optimizes the model computation procedure, relaxes the dependency of embeddings, and schedules the sparse communications of each embedding row with a priority queue. We have implemented a prototype of EmbRace based on PyTorch and Horovod, and conducted comprehensive evaluations with four representative NLP models. Experimental results show that EmbRace achieves up to 2.41X speedup compared to the state-of-the-art distributed training baselines.
翻译:在深神经网络(DNN)模型中,广泛采用了分布式数据的培训。虽然目前深层次的学习(DL)框架对于像图像分类模型这样的密集模型来说规模良好,但我们认为,这些DL框架对于自然语言处理(NLP)模型等分散模型具有相对较低的伸缩性,这些模型的嵌入式表非常稀少。大多数现有工作忽略了模型参数的广度,从而遭受大量但不必要的通信间接费用。在本文件中,我们提议EmbRace,这是一个高效的通信框架,用于加快对稀有模型分布式培训的沟通。EmbRace引入了Sparsity-aware混合通信,将AlltoAll和模型平行通信纳入数据单词培训,从而减少高度分散参数的通信管理。为了有效地将稀释式通信与前向和后向计算器两种计算方法相重叠,EmbRace进一步设计了2D通信调度方法,优化模型的嵌入依赖,并将每个嵌入行的稀散通信与优先排安排。我们实施了基于PyTollerch和模型的EmbX模型的Emb-realtravely-travely-traviewal lagyal lacuildal to sal superational to supertural superdustral superdudual subal to subal subal to subaldaldaldaldaldaldaldald.