Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on finding $\varepsilon$-stationary points has rarely been analyzed in the nonconvex setting. In this work, we present a novel communication-efficient distributed Adam in the parameter-server model for stochastic nonconvex optimization, dubbed {\em Efficient-Adam}. Specifically, we incorporate a two-way quantization scheme into Efficient-Adam to reduce the communication cost between the workers and server. Simultaneously, we adopt a two-way error feedback strategy to reduce the biases caused by the two-way quantization on both the server and workers, respectively. In addition, we establish the iteration complexity for the proposed Efficient-Adam with a class of quantization operators, and further characterize its communication complexity between the server and workers when an $\varepsilon$-stationary point is achieved. Finally, we apply Efficient-Adam to solve a toy stochastic convex optimization problem and train deep learning models on real-world vision and language tasks. Extensive experiments together with a theoretical guarantee justify the merits of Efficient Adam.
翻译:推广的适应性随机梯度方法被广泛用于大规模非电流优化,如深层学习模式等。然而,在非电流环境下,很少分析它们寻找美元等瓦列普西隆固定点的通信复杂性。在这项工作中,我们展示了一种新的通信效率分布式亚当,用于Stochatic非电流优化的参数-服务器模型,被称为“高效亚达姆 ” 。具体地说,我们将双向量化计划纳入高效Adam,以降低工人和服务器之间的通信成本。同时,我们采用了双向错误反馈战略,以减少双向对服务器和工人双向四分化造成的偏差。此外,我们还与一个四分化操作者班一起,为拟议的高效阿丹设定了循环复杂性,并在实现了美元等价美元稳定点时,进一步说明服务器和工人之间的通信复杂性。最后,我们运用高效的Aam,以共同解决一个托卡斯蒂克卡西隆高端的理论模型,并培训一个高效的智能模型。