Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.
翻译:许多深层学习领域都受益于利用日益扩大的、经过公共数据培训的神经网络,如国家实验室和计算机愿景的预培训模型,培训这些模型需要大量小型研究团体和独立研究人员无法获得的计算资源(如高常委会集群),解决的方法之一是,一些较小的小组将其计算资源汇集在一起,并培训一个惠及所有参与者的模型。不幸的是,在这种情况下,任何参与者都可能因故意或错误地发送不正确的更新信息而危及整个培训过程。在这类同行在场的情况下,培训需要采用有Byzantine容忍度的专门分布式培训算法。这些算法往往通过引入多余的通信或通过信任的服务器传递所有更新信息而牺牲效率,使得无法将其应用于大规模深层学习,而模型可以有数十亿个参数。在这项工作中,我们提出了一个新的安全(Byzantine-容忍)分散化培训协议,强调通信效率。