State-of-the-art machine learning models are routinely trained on large-scale distributed clusters. Crucially, such systems can be compromised when some of the computing devices exhibit abnormal (Byzantine) behavior and return arbitrary results to the parameter server (PS). This behavior may be attributed to a plethora of reasons, including system failures and orchestrated attacks. Existing work suggests robust aggregation and/or computational redundancy to alleviate the effect of distorted gradients. However, most of these schemes are ineffective when an adversary knows the task assignment and can choose the attacked workers judiciously to induce maximal damage. Our proposed method Aspis assigns gradient computations to worker nodes using a subset-based assignment which allows for multiple consistency checks on the behavior of a worker node. Examination of the calculated gradients and post-processing (clique-finding in an appropriately constructed graph) by the central node allows for efficient detection and subsequent exclusion of adversaries from the training process. We prove the Byzantine resilience and detection guarantees of Aspis under weak and strong attacks and extensively evaluate the system on various large-scale training scenarios. The principal metric for our experiments is the test accuracy, for which we demonstrate a significant improvement of about 30% compared to many state-of-the-art approaches on the CIFAR-10 dataset. The corresponding reduction of the fraction of corrupted gradients ranges from 16% to 99%.
翻译:最先进的机器学习模型定期在大规模分布式集群上接受培训。 至关重要的是,当一些计算装置出现异常( Byzantine) 行为,并将任意结果返回参数服务器( PS) 时,这类系统可能会受到损害。 这种行为可能归因于多种原因,包括系统故障和精心策划的袭击。 现有的工作表明,为了减轻扭曲的梯度的影响,可以进行强有力的聚合和/或计算冗余。 然而,当对手知道任务任务分配,并且可以明智地选择受攻击的工人以引起最大损害时,这些办法大多是无效的。 我们提议的Aspis 方法将梯度计算分配给工人节点,使用子基点分配,以便对工人节点的行为进行多重一致性检查。 检查中央节点计算出的梯度和后处理(在适当构造的图表中进行结点调查),可以有效检测和随后将对手排除在培训进程之外。 我们证明,Byzantine的复原力和检测保证在弱力和强力攻击下对受攻击的Aspis 进行明智的工人进行最大程度的损害。 我们提出的大规模培训方案的主要衡量标准是,我们实验的16项至99级标准的精确度测量的精确度的精确度方法。 我们对10号的精确度的精确范围进行了测试。