Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not only does it reduce the amount of communication per step, but also, counter-intuitively, can improve the upper bound in the convergence analysis. To show this, we revisit non-convex convergence analysis of topK stochastic gradient descent (SGD) and extend it from the unidirectional to the bidirectional setting. We also remove a restriction of the previous analysis that requires unrealistically large values of K. We experimentally evaluate bidirectional topK SGD against unidirectional topK SGD and show that models trained with bidirectional topK SGD will perform as well as models trained with unidirectional topK SGD while yielding significant communication benefits for large numbers of workers.
翻译:为了加快进程,经常使用分布式培训。分布式培训中最大的瓶颈之一是传递不同节点的梯度。提出了不同的梯度压缩技术以缓解通信瓶颈,包括将梯度斜度斜度缩放,将梯度切换成最大的 K 组件,然后将梯度切换到其他节点。一些作者调查了参数服务器框架中的顶K梯度缩放,在工人对服务器(上行)和服务器对工作员(下行)的方向上应用了顶K级压缩。目前公认的信念是,增加额外压缩会降低模型的趋同性能。相反,我们表明,增加下线压缩可能会改善顶级K级斜度的性能:不仅会降低每步的通信量,而且反直观地可以提高聚合分析的上限。为了显示这一点,我们重新审视了对上至服务器(上下行)梯度梯度梯度梯度梯度梯度梯度的不相趋同模型(SGD)的分析,并将它从单级压缩会降低模型的趋同模型的高度趋近。我们用高级的SK型硬度分析来显示前方的SK型平基定的顶部的顶部。我们还要的顶部的上将用前的硬度分析,用前两基级的硬值进行一项的硬度分析。