In distributed computing systems slow working nodes, known as stragglers, can greatly extend finishing times. Coded computing is a technique that enables straggler-resistant computation. Most coded computing techniques presented to date provide robustness by ensuring that the time to finish depends only on a set of the fastest nodes. However, while stragglers do compute less work than non-stragglers, in real-world commercial cloud computing systems (e.g., Amazon's Elastic Compute Cloud (EC2)) the distinction is often a soft one. In this paper, we develop hierarchical coded computing that exploits the work completed by all nodes, both fast and slow, automatically integrating the potential contribution of each. We first present a conceptual framework to represent the division of work amongst nodes in coded matrix multiplication as a cuboid partitioning problem. This framework allows us to unify existing methods and motivates new techniques. We then develop three methods of hierarchical coded computing that we term bit-interleaved coded computation (BICC), multilevel coded computation (MLCC), and hybrid hierarchical coded computation (HHCC). In this paradigm, each worker is tasked with completing a sequence (a hierarchy) of ordered subtasks. The sequence of subtasks, and the complexity of each, is designed so that partial work completed by stragglers can be used, rather than ignored. We note that our methods can be used in conjunction with any coded computing method. We illustrate this by showing how we can use our methods to accelerate all previously developed coded computing techniques by enabling them to exploit stragglers. Under a widely studied statistical model of completion time, our approach realizes a $66\%$ improvement in the expected finishing time. On Amazon EC2, the gain was $27\%$ when stragglers are simulated.
翻译:在分布式计算系统中,被称为累进式计算器的慢速工作节点可以大大延长完成时间。 编码计算是一种能够使累进式抗累进计算的技术。 迄今提出的大多数编码计算技术通过确保完成时间只取决于一组最快节点来提供稳健性。 但是, 在分布式计算器比非累进式计算器计算少工作, 在现实世界商业云计算系统中( 例如亚马逊的“ 亚马逊的 高价计算器” ), 区别往往是一个软的。 在本文中, 我们开发的等级编码计算方法能够利用所有节点完成的工作, 快速和缓慢地, 自动整合每种节点的潜在贡献。 我们首先提出了一个概念框架, 代表编码矩阵中各节点之间的分工, 以幼崽隔式的分隔器配置问题。 这个框架允许我们统一现有方法, 并激励新技术。 然后我们开发了三种等级编码计算方法, 我们用任何比分解的计算方法来加速计算( BICC ), 多级编码计算( MLCC), 多级编码计算( MLCC) 利用所有节点计算,, 自动计算, 自动计算, 将每个节分解的计算方法都用来利用每个节点算法 。