The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism has been widely recognized. We propose a new pipeline parallelism training framework, BaPipe, which can automatically explore pipeline parallelism training methods and balanced partition strategies for DNN distributed training. In BaPipe, each accelerator calculates the forward propagation and backward propagation of different parts of networks to implement the intra-batch pipeline parallelism strategy. BaPipe uses a new load balancing automatic exploration strategy that considers the parameters of DNN models and the computation, memory, and communication resources of accelerator clusters. We have trained different DNNs such as VGG-16, ResNet-50, and GNMT on GPU clusters and simulated the performance of different FPGA clusters. Compared with state-of-the-art data parallelism and pipeline parallelism frameworks, BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.
翻译:随着机器学习算法的复杂程度的提高,深神经网络的规模迅速扩大。为了满足对DNN培训的计算和记忆要求,基于模型平行主义的分散深层次学习得到广泛承认。我们提议一个新的管道平行培训框架,即巴皮佩,这个框架可以自动探索管道平行培训方法和DNN分布培训的平衡分割战略。在巴皮佩,每个加速器计算出网络不同部分的前沿传播和后向传播,以实施分包管道平行战略。巴皮佩使用一种新的负载平衡自动探索战略,考虑DNN模型的参数和加速器集群的计算、记忆和通信资源。我们培训了不同的DNN,如VGG-16、ResNet-50和GNMT,关于GPU的集群,并模拟了不同的FPGA集群的性能。与最新数据平行和管道平行框架相比,巴皮佩提供了不同平台上3.2x速度和4x记忆减少。