As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of smaller tasks that make up a large computing job. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy, in the form of task replication and erasure coding, provides diversity that allows a job to be completed when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. In situations of constrained resources (here a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy, among replication, coding and splitting, which minimizes the expected job completion time. We consider three common service time distributions and establish three models that describe scaling of these distributions with the task size. We find that different distributions with different scaling models operate optimally at different levels of redundancy, and thus may require very different code rates.
翻译:随着许多机器学习和其他算法的复杂程度和数据要求的增加,为了满足不断增长的计算和存储需求,分配计算变得必要,因为它能够平行地执行构成大量计算工作的较小任务。然而,任务服务时间的随机波动导致任务执行时间过长。任务复制和删除编码等形式的裁员提供了多样性,允许在只完成一部分冗余任务时完成一份工作,从而消除了对任务支离破碎的依赖。在资源有限的情况下(这里有固定数量的平行服务器),增加冗余减少了平行工作的可用资源。在本文中,我们描述多样性与平行交易的对比,并确定最佳战略,包括复制、编码和分离,从而最大限度地减少预期的工作完成时间。我们考虑三个共同服务时间分配,建立三个模型,说明按任务规模扩大这些分配的规模。我们发现,不同规模的分布不同分布在不同的冗余程度上运作,因此可能需要非常不同的代码率。