Garfield:拜占庭机器学习系统支持 (Garfield: System Support for Byzantine Machine Learning)

Byzantine Machine Learning (ML) systems are nowadays vulnerable for they require trusted machines and/or a synchronous network. We present Garfield, a system that provably achieves Byzantine resilience in ML applications without assuming any trusted component nor any bound on communication or computation delays. Garfield leverages ML specificities to make progress despite consensus being impossible in such an asynchronous, Byzantine environment. Following the classical server/worker architecture, Garfield replicates the parameter server while relying on the statistical properties of stochastic gradient descent to keep the models on the correct servers close to each other. On the other hand, Garfield uses statistically-robust gradient aggregation rules (GARs) to achieve resilience against Byzantine workers. We integrate Garfield with two widely-used ML frameworks, TensorFlow and PyTorch, while achieving transparency: applications developed with either framework do not need to change their interfaces to be made Byzantine resilient. Our implementation supports full-stack computations on both CPUs and GPUs. We report on our evaluation of Garfield with different (a) baselines, (b) ML models (e.g., ResNet-50 and VGG), and (c) hardware infrastructures (CPUs and GPUs). Our evaluation highlights several interesting facts about the cost of Byzantine resilience. In particular, (a) Byzantine resilience, unlike crash resilience, induces an accuracy loss, and (b) the throughput overhead comes much more from communication (70%) than from aggregation.

翻译：拜占庭机器学习(ML)系统目前很脆弱,因为它们需要信任的机器和/或同步的网络。我们展示了加菲尔德(Garfield),这个系统可以在不承担任何可信任的组件或任何通信或计算延迟约束的情况下在ML应用中实现拜占庭的复原力。加菲尔德(Garfield)利用ML特性来利用ML(ML)系统来取得进步,尽管在类似同步环境、拜占庭环境中无法达成共识。在古典服务器/工人架构之后,加菲尔德复制参数服务器,同时依赖随机梯度下降的统计特性,使模型在正确的服务器上保持相互接近。另一方面,加菲尔德(Garfield)使用可统计-robt梯度聚合规则(Garzantine工人的复原力)。我们把Garfield与两个广泛使用的ML框架(Tensorflow和PyTorrch)结合起来,同时实现透明度:根据两个框架开发的应用程序都不需要改变其界面的界面。我们的操作支持全方计算计算机和GPUPU的模型的模型, 。我们的准确性(OPU) 和G-L) 数据(我们对G-G-G-G-G-G-L) 数据库数据库数据库数据库(不同基数据、不同基基数据库、不同基底) 和基底(我们对G-L) 基数据库(对G-L) 和基数据库(对G-基) 和基数据进行评估(对G-L) 的估值(对G-L) 的估值(对G-L) 的估值(对G-L) 和基) 基) 和基) 做了一个基) 做了一个基数据(对G-L) 的(对G-基数据(对G-L-基) 基) 和基) 和基) 进行的估值(对G-L-基) 的估值(对G-L-L-L-L-L-L-L-L-L-L-L-L-L-L-C) 的(对G-L-C) 的(对G-L-L-L-L-C-C-L-L-L-L-L-L-L-