Some of the hardest problems in deep learning can be solved via pooling together computational resources of many independent parties, as is the case for scientific collaborations and volunteer computing. Unfortunately, any single participant in such systems can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.
翻译:深层次学习中最棘手的一些问题可以通过汇集许多独立方的计算资源来解决,科学协作和志愿计算就是一例。不幸的是,这类系统的任何单一参与者都可以通过故意或错误地发送不正确的最新信息来破坏整个培训过程。当着这些同龄人的培训需要使用拜占庭容忍度的专门分布式培训算法。这些算法往往通过引入多余的通信或通过信任的服务器传递所有更新信息而牺牲效率。因此,将这种算法应用于大规模分布式的深度学习是行不通的,而模型可以有数十亿参数。在这项工作中,我们提出了一个新的安全(Byzantine-容忍)分散化培训协议,强调通信效率。我们严格分析这一协议:特别是,我们为它抵抗拜占庭和Sybil的攻击提供了理论界限,并表明它有一个边际通信管理中心。为了证明它的实际效果,我们进行了大规模的图像分类和语言建模实验,让Byzantine攻击者在场。