在开放式协作中分配的深层学习 (Distributed Deep Learning in Open Collaborations)

Michael Diskin,Alexey Bukhtiyarov,Max Ryabinin,Lucile Saulnier,Quentin Lhoest,Anton Sinitsin,Dmitry Popov,Dmitry Pyrkin,Maxim Kashirin,Alexander Borzunov,Albert Villanova del Moral,Denis Mazur,Ilia Kobelev,Yacine Jernite,Thomas Wolf,Gennady Pekhimenko

from arxiv, Accepted to Conference on Neural Information Processing Systems (NeurIPS) 2021. 32 pages, 10 figures. Code: https://github.com/yandex-research/DeDLOC

Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.

翻译：为了满足这一需求,大型公司和机构使用专门的高性能计算机集群,其建造和维护费用对环境来说是昂贵的,远远超出大多数组织的预算。因此,一些研究方向成为少数大型工业甚至更少学术行为者的专属领域。为了缩小这种差距,较小的团体可以将其计算资源集中起来,并进行惠及所有参与者的合作实验。这个称为网格或志愿计算的模式在许多科学领域都取得了成功。然而,由于高密度、不对称带宽和志愿计算所特有的若干挑战,采用这种机器学习方法很难。我们仔细分析这些制约因素,并提出一个专门为合作培训设计的新的算法框架。我们展示了我们为SwaVA和ALBERT在现实条件下进行预培训的实效,并以成本的一小部分实现与传统设置相类似的业绩。最后,我们详细报告了40名参与者的成功合作语言模型预培训。