Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth. We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation. The exposed structure of the gradient evokes a new class of distributed learning algorithm, which is naturally more communication-efficient than full gradient sharing. Our approach, called distributed auto-differentiation (dAD), builds off a marriage of rank-based compression and the innate structure of the gradient as an outer-product. We demonstrate that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets. The future of distributed learning, we determine, need not be dominated by gradient-centric algorithms.
翻译:虽然分布式机器学习开辟了许多令人兴奋的新研究领域,但模型和数据分散于不同的机器、节点和地点,仍然导致大量的通信间接费用,阻碍了在现实世界环境中的可靠培训。侧重于梯度作为培训期间的主要共享统计数据,为分布式深层次学习带来了一些直观算法;然而,大型深神经网络(DNN)的梯度中心培训往往具有通信高度,往往需要额外的适应,如悬浮限制、压缩、量化等,以限制带宽。我们为分布式DNS培训采用了创新的、对通信友好的方法,利用了自动差异机制所揭示的梯度的外产品结构。显露的梯度结构引出了一种新的分布式学习算法,自然比完全的梯度共享更有通信效率。我们称之为传播式差异(dADD)的方法,从基于等级的压缩的结合和作为外源值的梯度结构建立起来。我们展示DADAD培训比其他分布式结构的变压式更高效,我们需要通过现代结构的变压式学习方式来决定我们大规模变压式的图像。