Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. signSGD can exploit mismatches between L1 and L2 geometry: when noise and curvature are much sparser than the gradients, signSGD is expected to converge at the same rate or faster than full-precision SGD. Measurements of the L1 versus L2 geometry of real networks support our theoretical claims, and we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss, we prove that the non-convex convergence rate of majority vote matches that of distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve both communication efficiency and high accuracy.
翻译:培训大型神经网络需要将学习分布在多个工人中,因为传递梯度的成本可能是一个很大的瓶颈。 符号SGD通过仅仅传送每个迷你批量随机梯度的标记来缓解这一问题。 我们证明它可以得到两个世界的最佳信号: 压缩梯度和SGD水平趋同率。 标志SGD可以利用L1和L2几何方法之间的不匹配: 当噪音和曲度比梯度少得多时, 信号SGD预计将以同样的速度或比完全精度SGD更快的速度汇合。 对真实网络的L1和L2几何测量值的测量支持我们的理论主张, 我们发现信号SGD的势头对应方能够匹配亚当在深层图像网模型上的准确度和趋同速度。 我们把我们的理论推广到分布式环境, 参数服务器使用多数票,使每个工人在两个方向都能用1比的梯度压缩工人- 服务器通信。 使用标度, 我们证明多数选票的非康克斯趋同率达到分布式的SGDVD的精确度。 因此, 将极有可能在分布式的通信计划上达到很高的精确度。