The emerging large-scale and data-hungry algorithms require the computations to be delegated from a central server to several worker nodes. One major challenge in the distributed computations is to tackle delays and failures caused by the stragglers. To address this challenge, introducing efficient amount of redundant computations via distributed coded computation has received significant attention. Recent approaches in this area have mainly focused on introducing minimum computational redundancies to tolerate certain number of stragglers. To the best of our knowledge, the current literature lacks a unified end-to-end design in a heterogeneous setting where the workers can vary in their computation and communication capabilities. The contribution of this paper is to devise a novel framework for joint scheduling-coding, in a setting where the workers and the arrival of stream computational jobs are based on stochastic models. In our initial joint scheme, we propose a systematic framework that illustrates how to select a set of workers and how to split the computational load among the selected workers based on their differences in order to minimize the average in-order job execution delay. Through simulations, we demonstrate that the performance of our framework is dramatically better than the performance of naive method that splits the computational load uniformly among the workers, and it is close to the ideal performance.
翻译:新兴的大型和数据饥饿算法要求将计算从中央服务器下放到几个工人节点。分布式计算中的一项重大挑战是解决由分流器造成的延误和故障。为了应对这一挑战,通过分布式编码计算引入了高效的冗余计算,引起了人们的极大关注。该领域最近的做法主要侧重于引入最小计算冗余,以容忍一定数量的分流。根据我们的知识,当前文献缺乏从一个中央服务器到多个工人节点的统一端对端设计,在不同的环境中,工人的计算和通信能力可能不同。本文的贡献是设计一个新颖的框架,在工人和流动计算工作的到来都以随机模型为基础的环境中,联合时间安排和编码。在我们最初的联合计划中,我们提出了一个系统框架,说明如何选择一组工人,以及如何根据他们之间的差异在选定的工人中分担计算工作量,以尽量减少按顺序计算工作的平均延迟。我们通过模拟,表明我们框架的运行状况比天性化的工人业绩分析标准要好得多。