ByzShield:一个高效和强有力的分配培训系统 (ByzShield: An Efficient and Robust System for Distributed Training)

Training of large scale models on distributed clusters is a critical component of the machine learning pipeline. However, this training can easily be made to fail if some workers behave in an adversarial (Byzantine) fashion whereby they return arbitrary results to the parameter server (PS). A plethora of existing papers consider a variety of attack models and propose robust aggregation and/or computational redundancy to alleviate the effects of these attacks. In this work we consider an omniscient attack model where the adversary has full knowledge about the gradient computation assignments of the workers and can choose to attack (up to) any q out of n worker nodes to induce maximal damage. Our redundancy-based method ByzShield leverages the properties of bipartite expander graphs for the assignment of tasks to workers; this helps to effectively mitigate the effect of the Byzantine behavior. Specifically, we demonstrate an upper bound on the worst case fraction of corrupted gradients based on the eigenvalues of our constructions which are based on mutually orthogonal Latin squares and Ramanujan graphs. Our numerical experiments indicate over a 36% reduction on average in the fraction of corrupted gradients compared to the state of the art. Likewise, our experiments on training followed by image classification on the CIFAR-10 dataset show that ByzShield has on average a 20% advantage in accuracy under the most sophisticated attacks. ByzShield also tolerates a much larger fraction of adversarial nodes compared to prior work.

翻译：在分布式群组上大规模模型的培训是机器学习管道的关键组成部分。然而,如果一些工人以对抗性(Byzantine)的方式将任意结果返回参数服务器(PS),这种培训很容易失败。大量现有文件考虑各种攻击模型,并提出强大的聚合和/或计算冗余,以减轻这些攻击的影响。在这项工作中,我们认为一个无所不知的攻击模型,即对手充分了解工人的梯度计算任务,并可以选择攻击(顶到)n工人节点的任何q,以引起最大损害。我们基于冗余的Byz Shield方法利用双边扩张图的特性,将工作任务分配给工人;这有助于有效减轻Byzantine行为的影响。具体地说,我们展示了一个最差的、最差的、腐败的梯子部分,该部分基于相互或深层次的拉丁方块和Ramanujan图。我们的数字实验显示,在36 %以上,根据我们最差的精密的精细度,根据我们最差的精细度,根据10级的梯度,对10级的精度的精细度,对10级数据进行了平均的精细度的精细度的精度,其精度的精度的精度进行了实验。根据36比比的精细的精细的精度试验显示的精度试验显示的精度试验显示的精度,也显示的精度的精细度,显示的精度比的精度比的精度的精度的精度的精度比的精度,显示的精度比。