In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communication-efficient distributed optimization, its convergence on non-smooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with non-smooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on non-smooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.
翻译:在本文中,我们证明当地(S)GD(或FedAvg)可以在多元时间以校正线条股激活功能优化深神经网络。尽管当地(S)GD关于优化通信效率分布优化方面一般顺畅功能的既定趋同理论,但在非悬浮的RELU网络上,它仍然无法完全的理论理解。许多当地(S)GD对平稳功能的分析中所使用的关键属性是斜度利普西茨特,这样,当地模型上的梯度不会远离平均模型。然而,这种体面的属性并不存在于非mooth ReLU激活功能的网络中。我们表明,尽管RELU网络不接受梯度利普西茨特性,但根据当地(SGD)的动态,地方模型的梯度和平均模型之间的差异不会太大。我们通过广泛的实验来验证我们的理论结果。这是第一次显示当地(SGD)非线条函数的趋同,并将在深神经网络的优化训练理论上留下光线。