We propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD). Our goal is to find an efficient parallelization strategy for a fixed batch size using $P$ processes. Our method is inspired by the communication-avoiding algorithms in numerical linear algebra. We see $P$ processes as logically divided into a $P_r \times P_c$ grid where the $P_r$ dimension is implicitly responsible for model/domain parallelism and the $P_c$ dimension is implicitly responsible for batch parallelism. In practice, the integrated matrix-based parallel algorithm encapsulates these types of parallelism automatically. We analyze the communication complexity and analytically demonstrate that the lowest communication costs are often achieved neither with pure model nor with pure data parallelism. We also show how the domain parallel approach can help in extending the theoretical scaling limit of the typical batch parallel method.
翻译:我们建议采用新的综合方法,利用模型、批量和领域平行方法,对大型分布式模拟计算机使用微型批量随机梯度下降(SGD)对深神经网络进行培训。我们的目标是为使用美元流程的固定批量规模找到有效的平行战略。我们的方法受到数字线性代数中避免通信的算法的启发。我们认为,美元流程在逻辑上分为P_r/ times P_c$格,其中,$P_r的维度对模型/常态平行主义间接负责,而$P_c$的维度对批量平行主义间接负责。在实践中,基于矩阵的综合平行算法自动包罗了这些类型的平行法。我们分析通信的复杂性和分析表明,最低通信成本往往既不使用纯模型,也不使用纯数据平行法。我们还表明,域平行法如何有助于扩大典型批量平行法的理论扩展限制。