A plethora of modern machine learning tasks require the utilization of large-scale distributed clusters as a critical component of the training pipeline. However, abnormal Byzantine behavior of the worker nodes can derail the training and compromise the quality of the inference. Such behavior can be attributed to unintentional system malfunctions or orchestrated attacks; as a result, some nodes may return arbitrary results to the parameter server (PS) that coordinates the training. Recent work considers a wide range of attack models and has explored robust aggregation and/or computational redundancy to correct the distorted gradients. In this work, we consider attack models ranging from strong ones: $q$ omniscient adversaries with full knowledge of the defense protocol that can change from iteration to iteration to weak ones: $q$ randomly chosen adversaries with limited collusion abilities which only change every few iterations at a time. Our algorithms rely on redundant task assignments coupled with detection of adversarial behavior. For strong attacks, we demonstrate a reduction in the fraction of distorted gradients ranging from 16%-99% as compared to the prior state-of-the-art. Our top-1 classification accuracy results on the CIFAR-10 data set demonstrate 25% advantage in accuracy (averaged over strong and weak scenarios) under the most sophisticated attacks compared to state-of-the-art methods.
翻译:大量现代机器学习任务要求利用大规模分布式集成群群群作为培训管道的关键组成部分。 但是,工人节点的反常拜占庭行为会破坏培训,损害推论的质量。这种行为可归因于无意的系统故障或精心策划的袭击;因此,一些节点可能会将任意结果归还到负责协调培训的参数服务器(PS)。最近的工作考虑了一系列广泛的攻击模式,并探索了强大的聚合和/或计算冗余,以纠正扭曲的梯度。在这项工作中,我们考虑的攻击模式有强势的:一元无所不知的对手,完全了解国防协议,从迭代转变为变变为弱方的防御协议。这种异常行为可以归因于无意的系统故障或精心策划的攻击;因此,一些节点可能会将任意的结果退回到负责协调培训的参数服务器(PS) 。 我们的算法依靠冗余的任务任务以及辨识对抗性行为。 关于强攻势攻击,我们展示了从16%至99 %的扭曲梯度到以前的状态等强势模型。我们头一分类最强的精确度,根据高劣的精确度对高劣的精确度,根据10-10号的精确度数据显示高劣的精确度的精确度对25。