To improve the utility of learning applications and render machine learning solutions feasible for complex applications, a substantial amount of heavy computations is needed. Thus, it is essential to delegate the computations among several workers, which brings up the major challenge of coping with delays and failures caused by the system's heterogeneity and uncertainties. In particular, minimizing the end-to-end job in-order execution delay, from arrival to delivery, is of great importance for real-world delay-sensitive applications. In this paper, for computation of each job iteration in a stochastic heterogeneous distributed system where the workers vary in their computing and communicating powers, we present a novel joint scheduling-coding framework that optimally split the coded computational load among the workers. This closes the gap between the workers' response time, and is critical to maximize the resource utilization. To further reduce the in-order execution delay, we also incorporate redundant computations in each iteration of a distributed computational job. Our simulation results demonstrate that the delay obtained using the proposed solution is dramatically lower than the uniform split which is oblivious to the system's heterogeneity and, in fact, is very close to an ideal lower bound just by introducing a small percentage of redundant computations.
翻译:为了提高学习应用的效用,并使机器学习的解决方案对复杂的应用具有可行性,需要大量大量计算。因此,必须把计算方法分配给若干工人,这带来了应对系统差异和不确定性造成的延误和故障的重大挑战。特别是,尽量减少从抵达到交付的端到端工作执行延误对于现实世界的延迟敏感应用非常重要。在本文中,为了计算每个工作在分流分布系统中的迭代,即工人的计算和通信能力各不相同,我们提出了一个新的联合时间安排编码框架,将编码计算负荷的编码在工人之间进行最佳的分割。这缩小了工人反应时间之间的缺口,对于最大限度地利用资源至关重要。为了进一步减少订单执行延迟,我们还在分配的计算工作的每一次迭代中都包含多余的计算。我们的模拟结果表明,在计算过程中,在计算过程中,使用拟议解决方案的延迟率大大低于统一分割,而统一拆分法是无法避免的,因为系统偏差和理想的计算结果中,一个更接近于一个更低的、更接近于理想的、更接近于现实的计算。