Straggler 缓解的序列渐变编码 (Sequential Gradient Coding For Straggler Mitigation)

In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.

翻译：在分布式计算中,较慢的节点(straglers)通常会成为一个瓶颈。由Tandon等人介绍的Grated Coding(GC)是一种有效的技术,它使用错误校正代码的原则,在分流器面前分配梯度计算。在本文中,我们考虑对每梯度序列的分布计算 $g(1),g(2),\ldots,g(J) ⁇ $,其中每梯度的处理从圆-美元开始,以圆-美元开始,以圆-美元(t+T)完成。这里,$Tgeq 0美元表示网络延迟参数。对于GC 方案, 编码只在计算节点和计算梯度计算结果时使用。另一方面, 如果有$T>0美元, 就可以设计一个利用时间层面的系统。在这项工作中, 我们的第一个方案将GC 与先前未完成的任务有选择的重复重复, 并实现改进的递减。在我们的第二个方案中, 我们的第二个方案, 是我们在使用一个主要的周期中, 我们的再使用一个循环中, 我们的周期中, 将一个循环中, 我们的的再使用一个循环的计算。