Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for computation at the agents is affected by the availability of local resources giving rise to the "straggler problem" in which the computation results are held back by unresponsive agents. For this problem, linear coding of the matrix sub-blocks can be used to introduce resilience toward straggling. The Parameter Server (PS) utilizes a channel code and distributes the matrices to the workers for multiplication. It then produces an approximation to the desired matrix multiplication using the results of the computations received at a given deadline. In this paper, we propose to employ Unequal Error Protection (UEP) codes to alleviate the straggler problem. The resiliency level of each sub-block is chosen according to its norm as blocks with larger norms have higher effects on the result of the matrix multiplication. We validate the effectiveness of our scheme both theoretically and through numerical evaluations. We derive a theoretical characterization of the performance of UEP using random linear codes, and compare it the case of equal error protection. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN), for which we investigate the fundamental trade-off between precision and the time required for the computations.
翻译:大型机器学习和数据挖掘方法通常在多个代理商之间分配计算结果,以便平行处理。代理商计算所需的时间受到当地资源供应情况的影响,导致计算结果被不反应代理商拖住的“累进器问题” 。对于这个问题,可以使用矩阵子块线性编码来引入螺旋变形的复原力。 参数服务器(PS)使用一个频道代码,并将矩阵向工人分发,以便进行倍增。然后利用在给定期限收到的计算结果,对UEP的性能乘法进行近似。在本文件中,我们提议使用不均误差保护(UEP)代码来缓解累进器问题。每个子块的弹性水平可以根据其规范选择,因为具有较大规范的区块对矩阵倍增效果有更大的影响。我们用随机线性代码和数字评估来验证我们的计划的有效性。我们用随机线性代码对UEP的性能进行理论描述,并比较相同的错误保护情况。我们还建议在深度计算中采用深度计算系统(我们要求的深度计算系统)的精确度战略,用于进行深度计算。