为分布的近似矩阵乘法通过不平等错误保护来减少分分布错误 (Straggler Mitigation through Unequal Error Protection for Distributed Approximate Matrix Multiplication)

Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources giving rise to the "straggler problem". As a remedy to this problem, linear coding of the matrix sub-blocks can be used, i.e., the Parameter Server (PS) utilizes a channel code to encode the matrix sub-blocks and distributes these matrices to the workers for multiplication. In this paper, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting in the presence of stragglers. The resiliency level of each sub-block is chosen according to its norm, as blocks with larger norms have higher effects on the result of the matrix multiplication. In particular, we consider two approaches in distributing the matrix computation: (i) a row-times-column paradigm, and (ii) a column-times-row paradigm. For both paradigms, we characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification task in the evaluation of the gradient during back-propagation. Our numerical experiments show that it is indeed possible to obtain significant improvements in the overall time required to achieve the DNN training convergence by producing matrix product approximations using UEP codes.

翻译：大型机器学习和数据挖掘方法通常在多个代理商之间分配计算,进行平行处理。代理商计算所需时间受当地资源供应情况的影响,从而产生“累进器问题”的影响。作为解决这个问题的一种补救措施,可以使用矩阵子块线性编码,即Parameter服务器(PS)使用一个频道代码来编码矩阵小块,并将这些矩阵分发给工人进行乘法。在本文中,我们使用不均误差保护代码,以便在分流计算设置中,在存在累进器的情况下,使矩阵产品接近于分布式计算。每个子块的弹性度水平是根据其规范选择的,因为带有较大规范的块对矩阵倍增效果有更大的效果。我们特别考虑使用两种方法来分配矩阵计算:(一) 排时-柱式模式,和(二) 列式递进式模式。对于这两种模式,我们从理论角度来描述拟议的矩阵产品在分布式计算过程中的矩阵组合组合,通过将预期的正进式计算结果转换为我们所要的正进式阵列的内式计算。我们用在深度计算中,在深度计算中,我们提出的内级的内级的内级计算中,我们为显示的内级计算中要的内级变进式变的升级的升级的升级,还算。我们为显示的深度计算中,还显示的升级的深度计算中,还显示我们准备的内式矩阵的升级的升级的升级的计算。