We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.
翻译:我们考虑在慢速和不反应的工人节点(称为分流器)面前进行分布式学习。为了减轻分流器的影响,梯度编码向工人分配部分计算,以便只能从非分层工人中回收总的结果。 渐变编码的设计是为了容忍固定数量的分流器。 由于在实践中的分流器数量是随机的和事先未知的, 容忍固定数量的分流器可以产生亚最佳的计算负荷, 并导致更高的耐久性。 我们提出了一个梯度编码方案, 通过仔细配置不同分流器的梯度编码来容忍灵活数量的分流器。 通过适当的任务时间安排和少量额外的信号, 我们的计划将工人的计算负荷调整到实际的分流器数量。 我们分析了我们提议的办法的长度, 并表明它比梯度码的长度要低得多。