We consider the problem of evaluating arbitrary multivariate polynomials over a massive dataset containing multiple inputs, on a distributed computing system with a master node and multiple worker nodes. Generalized Lagrange Coded Computing (GLCC) codes are proposed to simultaneously provide resiliency against stragglers who do not return computation results in time, security against adversarial workers who deliberately modify results for their benefit, and information-theoretic privacy of the dataset amidst possible collusion of workers. GLCC codes are constructed by first partitioning the dataset into multiple groups, then encoding the dataset using carefully designed interpolation polynomials, and sharing multiple encoded data points to each worker, such that interference computation results across groups can be eliminated at the master. Particularly, GLCC codes include the state-of-the-art Lagrange Coded Computing (LCC) codes as a special case, and exhibit a more flexible tradeoff between communication and computation overheads in optimizing system efficiency. Furthermore, we apply GLCC to distributed training of machine learning models, and demonstrate that GLCC codes achieve a speedup of up to $2.5\text{--}3.9\times$ over LCC codes in training time, across experiments for training image classifiers on different datasets, model architectures, and straggler patterns.
翻译:我们考虑在含有多个输入的大规模数据集上对任意多元多项式进行求值,使用主节点和多个工作节点的分布式计算系统。提出广义拉格朗日编码计算(GLCC)码,旨在同时提供对未能及时返回计算结果的停顿节点的弹性、针对恶意工作节点的安全性以及可能存在劫持的工人之间数据的信息理论隐私保护。GLCC编码首先将数据集分成多个组,然后使用精心设计的插值多项式对数据集进行编码,并向每个工人共享多个编码数据点,从而可以在主节点处消除跨组的干扰计算结果。特别地,GLCC码包含现有最先进的拉格朗日编码计算(LCC)码作为一种特殊情况,并在优化系统效率时展现出更灵活的通信和计算开销的权衡。此外,我们将GLCC应用于分布式机器学习模型的训练,并展示GLCC码在不同的数据集、模型架构和停顿节点模式的图像分类器训练实验中,训练时间可达到LCC码的2.5至3.9倍的加速。