用于培训深神经网络的通信-高效分布式梯度缩放算算法 (A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks)

In distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works, and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity and $O(\frac{1}{\epsilon^3})$ communication complexity for finding an $\epsilon$-stationary point in the homogeneous data setting, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup and reduced communication rounds. Our proof relies on novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.

翻译：在对深心神经网络的分布式培训中,人们通常在每台机器上运行Stochatic 梯度底部(SGD)或其变异体,并定期与其他机器进行交流。然而,SGD在培训一些深心神经网络(例如RNN、LSTM)时可能会缓慢地聚集在一起,因为梯度问题正在爆炸。渐变剪报通常用于在单一机器设置中解决这一问题,但在分布式环境中探索这种技术仍然处于初级阶段:对于梯度剪报计划能否利用多台机器来享受平行的加速。主要的技术困难在于处理非convex损失功能、非Lipschitz持续梯度以及同时跳过通信周期。在本文中,我们探索了对LSTM在先前作品中所显示的亏损场景的宽松度假设。这种算法可以在多台机器上运行,每台机器都使用梯度剪辑计划,并在基于梯度更新的多个步骤后与其他机器进行沟通。我们的算法证明,我们的递增速度(maxxxxxxxxxxxxxxxxxxxlialalalal_rationrevationration_ration_ration_ration_rationration_ration_ration_ration_ration_ration_ration_ration) exxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx