Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a ``good'' set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and 100x faster encoding compared to the available methods.
翻译:Straggler 节点是众所周知的分布式矩阵计算中的瓶颈,这种计算导致计算/通信速度的下降。减缓这种累进器的共同战略是将基于Reed-Solomon的MDS(最大距离分解)代码纳入框架;这样可以针对最佳数量的累进器实现复原力。然而,这些代码为工人节点分配了密集的子矩阵组合。当输入矩阵稀少时,这些方法增加了编码矩阵中非零条目的数量,而这反过来又对工人计算时间产生不利影响。在这项工作中,我们开发了一种分布式矩阵计算方法,其中指定的编码子矩阵是少量次矩阵的随机线性组合。除了适合稀薄的投入矩阵之外,我们的方法还在一定范围的问题参数中保持了最强的线性弹性组合。此外,与最近稀薄的矩阵计算方法相比,寻求“好”的随机系数集以促进我们方法的数值稳定性,而这反过来又对工人计算时间产生了很大的效率。我们显示,我们的方法可以高效地利用全方位的线性计算方法来进行全方位计算。