Existing gradient coding schemes introduce identical redundancy across the coordinates of gradients and hence cannot fully utilize the computation results from partial stragglers. This motivates the introduction of diverse redundancies across the coordinates of gradients. This paper considers a distributed computation system consisting of one master and $N$ workers characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with $L$ model parameters. We show that it is sufficient to provide at most $N$ levels of redundancies for tolerating $0, 1,\cdots, N-1$ stragglers, respectively. Consequently, we propose an optimal block coordinate gradient coding scheme based on a stochastic optimization problem that optimizes the partition of the $L$ coordinates into $N$ blocks, each with identical redundancy, to minimize the expected overall runtime for collaboratively computing the gradient. We obtain an optimal solution using a stochastic projected subgradient method and propose two low-complexity approximate solutions with closed-from expressions, for the stochastic optimization problem. We also show that under a shifted-exponential distribution, for any $L$, the expected overall runtimes of the two approximate solutions and the minimum overall runtime have sub-linear multiplicative gaps in $N$. To the best of our knowledge, this is the first work that optimizes the redundancies of gradient coding introduced across the coordinates of gradients.
翻译:现有的梯度编码方案在梯度坐标中引入了相同的冗余,因此无法充分利用部分梯度调整的计算结果。 这促使在梯度坐标中引入不同的冗余。 本文考虑一个分布式计算系统, 由一位主工和美元工人组成, 其特点是一个通用的部分梯度模型, 重点是用美元模型参数解决大型机器学习的一般性问题。 我们显示, 最多用美元标准来提供以美元计算的冗余, 以分别调低 $0, 1,\cdots, N-1$的冲淡器计算结果。 因此, 我们提出一个最佳的块坐标协调梯度编码方案, 其依据的是一个随机化优化优化的优化优化问题, 将美元坐标优化成美元区块, 每个区块都具有相同的冗余, 以尽量减少协作计算梯度的预期总运行时间。 我们用一种不稳的预测亚差法获得最佳解决方案, 提出两种以封闭式表达式表示的、 N-1美元 调整的梯度调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调制调制调制调制调制调制调制调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制。