The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022) identified four distinct generations of LT methods based on the quality of the provided theoretical communication complexity guarantees. Despite a lot of progress in this area, none of the existing works were able to show that it is theoretically better to employ multiple local gradient-type steps (i.e., to engage in LT) than to rely on a single local gradient-type step only in the important heterogeneous data regime. In a recent breakthrough embodied in their ProxSkip method and its theoretical analysis, Mishchenko et al. (2022) showed that LT indeed leads to provable communication acceleration for arbitrarily heterogeneous data, thus jump-starting the $5^{\rm th}$ generation of LT methods. However, while these latest generation LT methods are compatible with DS, none of them support CS. We resolve this open problem in the affirmative. In order to do so, we had to base our algorithmic development on new algorithmic and theoretical foundations.
翻译:Malinovsky等人(2022年)根据所提供的理论通信复杂性保证的质量,确定了四代不同的远程技术方法。尽管在这方面取得了很大进展,但现有的工作都无法显示在理论上采用多种本地梯度型步骤(即,参与LT)比在重要的多元数据制度中只依赖单一的本地梯度型步骤要好得多。 最近,Mishchenko等人(2022年)在ProxSkip方法及其理论分析方面的突破表明,远程技术确实导致任意的多元数据可调和的通信加速,从而触发了新一代LT方法的5°rm 。然而,尽管这些最新一代的LT方法与我们的DS理论基础一致,但我们的这种理论基础并没有支持C。