Training of large scale models on distributed clusters is a critical component of the machine learning pipeline. However, this training can easily be made to fail if some workers behave in an adversarial (Byzantine) fashion whereby they return arbitrary results to the parameter server (PS). A plethora of existing papers consider a variety of attack models and propose robust aggregation and/or computational redundancy to alleviate the effects of these attacks. In this work we consider an omniscient attack model where the adversary has full knowledge about the gradient computation assignments of the workers and can choose to attack (up to) any q out of K worker nodes to induce maximal damage. Our redundancy-based method ByzShield leverages the properties of bipartite expander graphs for the assignment of tasks to workers; this helps to effectively mitigate the effect of the Byzantine behavior. Specifically, we demonstrate an upper bound on the worst case fraction of corrupted gradients based on the eigenvalues of our constructions which are based on mutually orthogonal Latin squares and Ramanujan graphs. Our numerical experiments indicate over a 36% reduction on average in the fraction of corrupted gradients compared to the state of the art. Likewise, our experiments on training followed by image classification on the CIFAR-10 dataset show that ByzShield has on average a 20% advantage in accuracy under the most sophisticated attacks. ByzShield also tolerates a much larger fraction of adversarial nodes compared to prior work.
翻译:在分布式群组上进行大规模模型培训是机器学习管道的关键组成部分。然而,如果一些工人以对抗性(Byzantine)的方式将任意结果返回参数服务器(PS),这种培训很容易失败。大量现有文件考虑各种攻击模型,并提出强有力的聚合和/或计算冗余,以减轻这些攻击的影响。在这项工作中,我们认为一个无所不知的攻击模型,即对手充分了解工人的梯度计算任务,并且可以选择攻击(顶到)K工人节点的任何q,以引起最大损害。我们基于冗余的Byz Shield方法利用双边扩张图的特性,将工作任务分配给工人;这有助于有效减轻Byzantine行为的影响。具体地说,我们展示了最差的、腐败的梯度部分,其基础是我们建筑建筑的精度,其依据是相互或深层次的拉丁方块和拉马努扬图。我们的数字实验显示,在36 % 以比更高的比例上,根据我们最低的机率对10级结构的精度的精度的精度,其深度的精度的精度,其精度比根据10-10级数据的精度的精度的精度的精度的精度的精度,其精度的精度的精度的精度的精度实验显示,其精度的精度的精度的精度的精度的精度也显示。