We study distributed optimization methods based on the {\em local training (LT)} paradigm: achieving communication efficiency by performing richer local gradient-based training on the clients before parameter averaging. Looking back at the progress of the field, we {\em identify 5 generations of LT methods}: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear, and 5) accelerated. The 5${}^{\rm th}$ generation, initiated by the ProxSkip method of Mishchenko, Malinovsky, Stich and Richt\'{a}rik (2022) and its analysis, is characterized by the first theoretical confirmation that LT is a communication acceleration mechanism. Inspired by this recent progress, we contribute to the 5${}^{\rm th}$ generation of LT methods by showing that it is possible to enhance them further using {\em variance reduction}. While all previous theoretical results for LT methods ignore the cost of local work altogether, and are framed purely in terms of the number of communication rounds, we show that our methods can be substantially faster in terms of the {\em total training cost} than the state-of-the-art method ProxSkip in theory and practice in the regime when local computation is sufficiently expensive. We characterize this threshold theoretically, and confirm our theoretical predictions with empirical results.
翻译:我们研究基于地方培训(LT)范式的分布优化方法:通过在平均参数之前对客户进行更富的地方梯度培训实现通信效率。回顾实地的进展,我们发现5代LT方法 :(1) 脂质,(2) 均质,(3) 亚线性,(4) 线性,和(5) 加速。由Mishchenko、Malinovsky、Stich和Richt\{a}rik(2022年)的ProxSkip方法发起的5 $ rm th 和 5 美元 的一代。 由Mishchenko、Malinovsky、Stich和Richt\}rik (2022年) 及其分析提出的5 $th_th_th} 优化方法,其特点是第一次理论确认LT是一个通信加速机制。受最近的进展鼓舞,我们为5 $m rm th} =th $th lt 方法的一代贡献,显示有可能进一步利用 $em diff developations compressal roalalalalalalal roalalalalal roal yal yal 和我们理论模型的模型的模型分析结果充分确认。