Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.
翻译:分级 SGD (H-SGD) 是一个新的分布式多级通信网络 SGD 算法。 在H-SGD 中, 在每个全球汇总之前, 工人将最新的本地模型发送到本地服务器, 用于聚合。 尽管最近的研究努力, 本地聚合对全球趋同的影响仍然缺乏理论理解。 在这项工作中, 我们首先引入一个新的“ 向上” 和“ 向下” 差异的概念。 然后我们用它来进行新颖的分析, 以获得两个级别HSGD 的最坏情况趋同, 包括非IID数据、 非convex 目标功能和 schoachtical 梯度。 通过将这一结果扩展至随机分组, 我们发现H- SGD 的这种趋同在两个单级地方 SGD 设置的上限之间, 与 H-SG 的本地和全球更新期的相等数。 我们称之为“ sandwich ” 。 此外, 我们扩展了我们基于“ 上向上” 和“ 向下向上” 的“ 向下” 偏向上” 分析方法, 分析方法, 我们的GSG 将“ 的“ 的趋同” 更有利于的GGSG 的GI 的GD 的趋和“ 的趋同” 进行 进行 的 的理论分析结果, 的“ 进行更 的归和“ 的“ 的理论的趋同” 。