优化基于优化的在分布式学习中减少部分施压器的块块协调渐变编码 (Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning)

Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...

翻译：渐进编码计划通过引入与所有模型参数相对应的代码化本地部分衍生物的相同冗余性,有效减轻分布式学习中的全部累赘。但是,它们对于部分累赘者不再有效, 因为他们不能使用部分累赘者不完全的计算结果。本文旨在设计一个新的梯度编码计划, 以减少分布式学习中的部分累累累者。具体地说, 我们考虑一个分布式系统, 由一位主和N工人组成, 其特点是一般的局部累累累式模型, 并侧重于解决与使用梯度编码等同的L模型参数的大规模大规模机读问题。首先, 我们提出一个梯度编码计划, 代表L 可能的L 不同差异值, 从而生成大多数梯度编码计划。然后, 我们考虑将预期的总体运行时间最小化和最大完成率参数最大化。为了降低计算复杂性, 我们首先将每个系统都转换成一个相当但更简单得多的离裂变数。