Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or poor channel conditions giving rise to the "straggler problem". As a remedy to this problem, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting to provide higher protection for the blocks with higher effect on the final result. We characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification task in the evaluation of the gradients. Our numerical experiments show that it is indeed possible to obtain significant improvements in the overall time required to achieve the DNN training convergence by producing approximation of matrix products using UEP codes in the presence of stragglers.
翻译:大型机器学习和数据挖掘方法经常在多个代理商之间分配计算结果,以便平行处理。代理商计算所需的时间受到当地资源可用性和/或引起“分层问题”的频道条件差的影响。作为解决这个问题的一种补救办法,我们采用不平等错误保护代码,以便在分布式计算环境中获得矩阵产品的近似值,为区块提供更高程度的保护,从而对最终结果产生更高影响。我们从理论角度从理论角度来描述拟议方法的绩效,方法是将与非焦点相关的条目矩阵的预期重建错误捆绑起来。我们还采用拟议的编码战略,计算深神经网络(DNNN)培训中的回推进步骤,以完成对梯度的评估中的图像分类任务。我们的数字实验表明,确实有可能在总的时间上实现DNN培训融合,通过在挤压者在场的情况下使用UEP代码制作矩阵产品的近似值。