Polynomial based methods have recently been used in several works for mitigating the effect of stragglers (slow or failed nodes) in distributed matrix computations. For a system with $n$ worker nodes where $s$ can be stragglers, these approaches allow for an optimal recovery threshold, whereby the intended result can be decoded as long as any $(n-s)$ worker nodes complete their tasks. However, they suffer from serious numerical issues owing to the condition number of the corresponding real Vandermonde-structured recovery matrices; this condition number grows exponentially in $n$. We present a novel approach that leverages the properties of circulant permutation matrices and rotation matrices for coded matrix computation. In addition to having an optimal recovery threshold, we demonstrate an upper bound on the worst-case condition number of our recovery matrices which grows as $\approx O(n^{s+5.5})$; in the practical scenario where $s$ is a constant, this grows polynomially in $n$. Our schemes leverage the well-behaved conditioning of complex Vandermonde matrices with parameters on the complex unit circle, while still working with computation over the reals. Exhaustive experimental results demonstrate that our proposed method has condition numbers that are orders of magnitude lower than prior work.
翻译:在分布式矩阵计算中,一些工程最近使用了基于聚合法的方法来减轻分流式节点(低节点或失灵节点)的影响。对于一个用美元工人节点的系统,美元工人节点可以作为分解器的分流器,这些方法允许一个最佳回收阈值,这样只要工人节点的任何美元(n-s)美元能够完成任务,就能够对预期结果进行解码。然而,由于相应的实际Vandermonde结构恢复矩阵的条件数目,它们遇到严重的数字问题;这一条件数以美元成倍增长。我们提出的一种新做法是,在计算编码式矩阵时,利用循环调整矩阵和旋转矩阵的特性。除了最佳回收阈值外,我们还展示了我们恢复矩阵最坏条件的上限,即以美元增长O(n-s)+5.5美元增长;但在实际假设中,美元是不变的,这种条件以美元成倍增。我们的各种计划都利用了复杂Vandermanx定型矩阵的正常工作节点,同时展示了复杂Vanderasim rodual rodual roduction rographil ex roduction ex rodumagradududududududududududuction lax pis ex pis ex prestigradududududuductions ex pal ex presticleglex ex paldaldaldal ex pal ex pal ex paldaldaldaldaldaldaldaldaldaldaldaldaldalds ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex extigradududucaldal ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex romaticaldal ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex