翻译标题： ProxSkip：是的！局部梯度步骤可以证明导致通信加速！最终！ (ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!)

We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($\psi$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\mathcal{O}\left(\kappa \log \frac{1}{\varepsilon}\right)$, where $\kappa$ is the condition number of $f$, the number of prox evaluations is $\mathcal{O}\left(\sqrt{\kappa} \log \frac{1}{\varepsilon}\right)$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods, such as FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching, that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.

翻译：翻译摘要：我们介绍了ProxSkip——一种出乎意料地简单且可以证明有效的方法，用于最小化平滑（f）和昂贵的非平滑可近似函数（Ψ）的总和。解决这些问题的经典方法是通过近端梯度下降（ProxGD）算法，该算法基于每次迭代中f的梯度和Ψ的可近似算子的评估。在此工作中，我们特别关注于当相对于梯度的评估代价而言prox的评估代价昂贵时的情况，这在许多应用中都是如此。 ProxSkip允许在大多数迭代中跳过昂贵的prox算子：虽然其迭代复杂度为O（κlog1/ε），其中κ是f的条件数，但prox评估次数仅为O（√κlog1/ε）。我们的主要动机来自于联邦学习，在其中，梯度运算符的评估对应于在所有设备上独立进行局部GD步骤，而prox的评估对应于（昂贵的）梯度平均的通信形式。在这种情况下，ProxSkip提供了通信复杂性的有效加速。与其他局部梯度型方法（如FedAvg，SCAFFOLD，S-Local-GD和FedLin）不同，它们的理论通信复杂性在异构数据范围内比甚至最坏的情况还要好，并且最多与香草GD匹配。我们获得了一个可证明的、显著的改进，而没有任何异质性限制的假设。