Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via system relaxations: quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95 times) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.
 翻译:在算法方面,研究人员提出了通过系统放松降低通信的广泛技术:量化、权力下放和通信延迟。然而,大多数现有系统(如果不是全部的话)仅仅依靠标准的同步和非同步的分散式梯度优化(SG),因此,现有系统无法利用机器学习界最近发展的所有可能的优化。鉴于当前系统和理论景观之间正在出现的差距,我们建设BAGUA,这是一个设计框架,其设计目标是提供一个系统抽象化,既灵活又模块化,以支持最先进的分布式培训系统放松技术。如果不是全部的话,大多数现有系统只是依靠标准的同步和非同步的分散式梯度优化(SG),因此无法利用机器学习界最近发展的所有可能的优化。在有16台机器学习群的生产(128台GPUs)中,BAGUA可以超越PyToch-D的当前景观与理论之间正在出现的差距。 设计框架的目的是提供一种系统抽象的系统抽象,既能支持最先进的系统松懈的系统,又能显示我们在不同时期进行最严格的贸易周期(BY-D-D-Dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx