Byzantine Byzantine 使用斯托气梯底和诺曼基比较梯度消除法的分散式分散式分散式机器学习 (Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE))

from arxiv, The report includes 52 pages, and 16 figures. Extension of our prior work on Byzantine fault-tolerant distribution optimization (arXiv:1903.08752 and doi:10.1145/3382734.3405748) to Byzantine fault-tolerant distributed machine learning; Updated to the full version of workshop paper in DSN-DSML '21

This paper considers the Byzantine fault-tolerance problem in distributed stochastic gradient descent (D-SGD) method - a popular algorithm for distributed multi-agent machine learning. In this problem, each agent samples data points independently from a certain data-generating distribution. In the fault-free case, the D-SGD method allows all the agents to learn a mathematical model best fitting the data collectively sampled by all agents. We consider the case when a fraction of agents may be Byzantine faulty. Such faulty agents may not follow a prescribed algorithm correctly, and may render traditional D-SGD method ineffective by sharing arbitrary incorrect stochastic gradients. We propose a norm-based gradient-filter, named comparative gradient elimination (CGE), that robustifies the D-SGD method against Byzantine agents. We show that the CGE gradient-filter guarantees fault-tolerance against a bounded fraction of Byzantine agents under standard stochastic assumptions, and is computationally simpler compared to many existing gradient-filters such as multi-KRUM, geometric median-of-means, and the spectral filters. We empirically show, by simulating distributed learning on neural networks, that the fault-tolerance of CGE is comparable to that of existing gradient-filters. We also empirically show that exponential averaging of stochastic gradients improves the fault-tolerance of a generic gradient-filter.

翻译：本文考虑了分布式随机梯度下降法(D-SGD)中的Byzantine断层容忍度问题,这是分布式多试机学习的一种流行算法。在此问题上,每个代理商都对数据分布进行抽样,而不受某些数据生成的分布。在无过失的情况下,D-SGD方法允许所有代理商学习最符合所有代理商集体抽样的数据的数学模型。我们认为,当一部分代理商可能是Byzantine错误时,这种错误代理商可能不正确遵循一种规定的算法,并且通过分享任意不正确的多试样梯度,使传统的D-SGD方法无效。我们建议采用基于规范的梯度过滤器,命名为比较梯度消除(CGE),使D-SGD方法对拜占庭代理商的分布更加牢固。我们表明,在标准随机假设假设下,Gyzantine代理商对被捆绑的一小部分的Byzantine代理商实行错误容忍度,而且与许多现有的梯度过滤器相比,例如多KRUM、几度中度中位中位度中位度中位,以及比的Basseral-clationalbasseral-creal、我们通过可比较的Bislational-cal-cal-cleval-calbleval-crealbalbalbal-crealbal、我们展示地展示式的GLisal-dal-dal-dalbilalbal-Sle、我们展示了现有GLisalbildalisalbildal-dal-Sle、我们展示了一种实验性网络的升级的比。