规模安全分配培训 (Secure Distributed Training at Scale)

Some of the hardest problems in deep learning can be solved with the combined effort of many independent parties, as is the case for volunteer computing and federated learning. These setups rely on high numbers of peers to provide computational resources or train on decentralized datasets. Unfortunately, participants in such systems are not always reliable. Any single participant can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.

翻译：深层次学习中的一些最棘手的问题可以通过许多独立方的共同努力来解决,如志愿计算和联合学习,这些设置依靠大量同龄人提供计算资源或培训分散化的数据集。不幸的是,这些系统的参与者并不总是可靠。任何一个参与者都可以通过故意或错误地发送不正确的更新信息来破坏整个培训过程。当着这些同龄人的培训需要用Byzantine耐受性来进行专门的分布式培训算法。这些算法往往通过引入多余的通信或通过可信赖的服务器传递所有更新信息而牺牲效率。因此,这些算法可能无法应用于大规模分布式的深度学习,而模型可以有数十亿项参数。在这项工作中,我们提出了一个新的安全(Byzantine容忍性)分散化培训协议,强调通信效率。我们严格分析这一协议:特别是,我们为抵制Byzantine和Sybil袭击提供了理论约束,并表明它具有边际通信间接间接成本。为了证明它的实际效果,我们进行了大规模的图像分类和语言建模实验。