The overall execution time of distributed matrix computations is often dominated by slow worker nodes (stragglers) within the clusters. Recently, different coding techniques have been utilized to mitigate the effect of stragglers where worker nodes are assigned the job of processing encoded submatrices of the original matrices. In many machine learning or optimization problems the relevant matrices are often sparse. Several prior coded computation methods operate with dense linear combinations of the original submatrices; this can significantly increase the worker node computation times and consequently the overall job execution time. Moreover, several existing techniques treat the stragglers as failures (erasures) and discard their computations. In this work, we present a coding approach which operates with limited encoding of the original submatrices and utilizes the partial computations done by the slower workers. While our scheme can continue to have the optimal threshold of prior work, it also allows us to trade off the straggler resilience with the worker computation speed for sparse input matrices. Extensive numerical experiments done in AWS (Amazon Web Services) cluster confirm that the proposed approach enhances the speed of the worker computations (and thus the whole process) significantly.
翻译:分布式矩阵计算的总体执行时间往往由各组群中慢速的工人节点(分流器)决定。 最近,利用了不同的编码技术来减轻分流器的作用,其中工人节点被指派处理原始矩阵的编码子矩阵。在许多机器学习或优化问题中,相关的矩阵往往很稀少。一些先前的编码计算方法使用原始次矩阵的密集线性组合操作;这可以大大增加工人节点计算时间,从而大大增加整个工作执行时间。此外,一些现有的技术将分流器作为故障(压缩)处理,并丢弃其计算方法。在这项工作中,我们提出了一个编码方法,在对原始子矩阵进行有限的编码,并利用较慢的工人所作的部分计算。虽然我们的计算方法可以继续拥有最理想的先前工作门槛,但它也使我们能够用工人的分流式计算速度来交换空输入矩阵。在AWS(Amazon Web Services)组群群中进行的广泛数字实验,证实所提议的方法大大加快了工人计算速度(以及整个过程)。