Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees for one variant of BRIDGE are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning.
翻译:在许多应用程序中,机器学习已经开始发挥核心作用。 在现实世界的分散化环境中, 节点通常会因设备失灵、 网络攻击等而出现无法察觉的失败。 由于设计限制(如多试系统)或计算/隐私原因(如智能手机数据学习)或计算/隐私原因(如智能手机数据学习),这些应用程序通常要求以分散化的方式开展学习任务,没有直接连接所有节点的中央服务器。 在现实世界的分散化环境中, 节点还容易在多个计算机设备/机器中分布出无法察觉的失败。 这可能冲破非机器人的统计学习算盘(如多试剂系统) 。 本文的重点是在经历了拜占庭失败失败的节点时, 分散化学习任务, 从而可以任意偏离它们想要的行为, 从而确保最强的算法设计。 但是在分散化的框架中, 易分流化的递增的递增性 。 特别是, 现有的 Byzan- deal- develrial- reliversal disal disal disal disal dism 。