Tensors, i.e., multi-linear functions, are a fundamental building block of machine learning algorithms. In order to train on large data-sets, it is common practice to distribute the computation amongst workers. However, stragglers and other faults can severely impact the performance and overall training time. A novel strategy to mitigate these failures is the use of coded computation. We introduce a new metric for analysis called the typical recovery threshold, which focuses on the most likely event and provide a novel construction of distributed coded tensor operations which are optimal with this measure. We show that our general framework encompasses many other computational schemes and metrics as a special case. In particular, we prove that the recovery threshold and the tensor rank can be recovered as a special case of the typical recovery threshold when the probability of noise, i.e., a fault, is equal to zero, thereby providing a noisy generalization of noiseless computation as a serendipitous result. Far from being a purely theoretical construction, these definitions lead us to practical random code constructions, i.e., locally random p-adic alloy codes, which are optimal with respect to the measures. We analyze experiments conducted on Amazon EC2 and establish that they are faster and more numerically stable than many other benchmark computation schemes in practice, as is predicted by theory.
翻译:光标,即多线性功能,是机器学习算法的基本组成部分。为了在大型数据集上进行培训,通常的做法是在工人中分配计算方法。然而,分层和其他差错会严重影响业绩和总体培训时间。一种减轻这些失灵的新战略是使用编码计算法。我们引入一种新的分析指标,称为典型的回收阈值,侧重于最可能的事件,并提供与这一措施最理想的分布式编码式高压操作的新结构。我们表明,我们的总框架包括许多其他计算计划和指标,这是一个特殊案例。特别是,我们证明,当噪音概率(即过错)等于零时,可将回收阈值和高压等级作为典型的回收阈值的一个特殊案例加以恢复,从而将无噪音的计算作为一种意外结果进行杂音的概括化。这些定义不仅是一种纯粹的理论构建,还引导我们找到实用的随机代码构造,例如,当地随机的p-adic alloy方法,作为一个特殊案例。我们证明,当噪音概率(即断层)的模型比其他预测方法更精确地分析,我们用其他的模型来确定。