Several efficient distributed algorithms have been developed for matrix-matrix multiplication: the 3D algorithm, the 2D SUMMA algorithm, and the 2.5D algorithm. Each of these algorithms was independently conceived and they trade-off memory needed per node and the inter-node data communication volume. The convolutional neural network (CNN) computation may be viewed as a generalization of matrix-multiplication combined with neighborhood stencil computations. We develop communication-efficient distributed-memory algorithms for CNNs that are analogous to the 2D/2.5D/3D algorithms for matrix-matrix multiplication.
 翻译:为矩阵矩阵矩阵乘法开发了几种有效的分布式算法:3D算法、2D SUMA算法和2.5D算法。这些算法都是独立设想的,每个节点和节点间数据通信量都需要取舍内存。进化神经网络的计算可被视为矩阵乘法的概括化,与邻里Stencils计算法相结合。我们为CNN开发了通信高效的分布式模型算法,类似于用于矩阵矩阵矩阵矩阵乘法的2D/2.5D/3D算法。