Matrix multiplication is a fundamental operation in machine learning and is commonly distributed into multiple parallel tasks for large datasets. Stragglers and other failures can severely impact the overall completion time. Recent works in coded computing provide a novel strategy to mitigate stragglers with coded tasks, with an objective of minimizing the number of tasks needed to recover the overall result, known as the recovery threshold. However, we demonstrate that this combinatorial definition does not directly optimize the probability of failure. In this paper, we introduce a novel analytical metric, which focuses on the most likely event and measures the optimality of a coding scheme by its probability of decoding. Our general framework encompasses many other computational schemes and metrics as a special case. Far from being a purely theoretical construction, these definitions lead us to a practical construction of random codes for matrix multiplication, i.e., locally random alloy codes, which are optimal with respect to the measures. We present experimental results on Amazon EC2 which empirically demonstrate the improvement in terms of running time and numerical stability relative to well-established benchmarks.
翻译:矩阵乘法是机器学习中的一项基本操作,通常分布为大型数据集的多重平行任务。 Stragglers 和其他失败会严重影响整个完成时间。 最近在编码计算中开展的工作提供了一种新的战略,以减轻执行编码任务的分解者,目的是最大限度地减少恢复总体结果所需的任务数量,称为回收阈值。 然而,我们证明,这一组合定义并不直接优化失败概率。 在本文中,我们引入了一个新的分析指标,它侧重于最可能的事件,并衡量编码计划的最佳性,因为它有可能解码。我们的总框架包括许多其他计算计划和衡量标准,作为一个特例。这些定义远不是纯粹的理论构建,而是导致我们实际地构建矩阵乘法的随机代码,即本地随机合金代码,这些代码与措施相比是最佳的。我们在亚马逊EC2上提出了实验结果,从经验上证明了运行时间和数字稳定性相对于既定基准的改进。