不同环境的分散化培训任务分配 (Task allocation for decentralized training in heterogeneous environment)

The demand for large-scale deep learning is increasing, and distributed training is the current mainstream solution. Ring AllReduce is widely used as a data parallel decentralized algorithm. However, in a heterogeneous environment, each worker calculates the same amount of data, so that there is a lot of waiting time loss among different workers, which makes the algorithm unable to adapt well to heterogeneous clusters. Resources are not used as they should be. In this paper, we design an implementation of static allocation algorithm. The dataset is artificially allocated to each worker, and samples are drawn proportionally for training, thereby speeding up the training speed of the network in a heterogeneous environment. We verify the convergence and influence on training speed of the network model under this algorithm on one machine with multi-card and multi-machine with multi-card. On this basis of feasibility, we propose a self-adaptive allocation algorithm that allows each machine to find the data it needs to adapt to the current environment. The self-adaptive allocation algorithm can reduce the training time by nearly one-third to half compared to the same proportional allocation.In order to better show the applicability of the algorithm in heterogeneous clusters, We replace a poorly performing worker with a good performing worker or add a poorly performing worker to the heterogeneous cluster. Experimental results show that training time will decrease as the overall performance improves. Therefore, it means that resources are fully used. Further, this algorithm is not only suitable for straggler problems, but also for most heterogeneous situations. It can be used as a plug-in for AllReduce and its variant algorithms.

翻译：大规模深层学习的需求正在增加, 分布式培训是当前的主流解决方案。环全环被广泛用作数据平行分散算法。然而, 在一种多样化的环境中, 每个工人计算出相同数量的数据, 从而不同工人之间大量等待时间损失, 使得算法无法适应不同分类群。资源没有按应该使用的方式使用。我们在本文件中设计一个静态分配算法。数据集是人为地分配给每个工人的, 样本被按比例地抽取用于培训, 从而加快网络在异质环境中的培训速度。我们核实在多卡和多卡的一台机器上, 网络模式在这种机器上的培训速度的趋同性和影响力。根据这个可行性, 我们提出一个自适应性分配算法, 使每个机器都能找到适合当前环境的数据。自我适应性分配算法只能将培训时间减少近三分之一到一半, 与同一比例分配法相比, 。为了更好地显示在多卡和多卡的组合中, 网络模式的趋同性, 我们用一个糟糕的工人业绩表现方式来完全适应当前环境。。使用一个不易变式的工人的算法, 。将它用来做一个不善变式计算法, 。将用来做一个不善的工人做一个不善的演算法, 。。做一个不善的工人做一个不善的演算法, 做一个不善的工人的演算法, 。