Wall-clock convergence time and communication load are key performance metrics for the distributed implementation of stochastic gradient descent (SGD) in parameter server settings. Communication-adaptive distributed Adam (CADA) has been recently proposed as a way to reduce communication load via the adaptive selection of workers. CADA is subject to performance degradation in terms of wall-clock convergence time in the presence of stragglers. This paper proposes a novel scheme named grouping-based CADA (G-CADA) that retains the advantages of CADA in reducing the communication load, while increasing the robustness to stragglers at the cost of additional storage at the workers. G-CADA partitions the workers into groups of workers that are assigned the same data shards. Groups are scheduled adaptively at each iteration, and the server only waits for the fastest worker in each selected group. We provide analysis and experimental results to elaborate the significant gains on the wall-clock time, as well as communication load and computation load, of G-CADA over other benchmark schemes.
翻译:在参数服务器设置中,隔夜趋同时间和通信负荷是分散地执行随机梯度下降的关键性能衡量标准; 最近提出了通过工人的适应性选择来减少通信负荷的方法; 隔夜趋同时间和通信负荷是按工人的适应性选择减少通信负荷的一种方法; 隔夜趋同时间和当着分流者的面,工作表现会因隔夜趋同时间而退化; 本文提出一个名为基于分组的CADA(G-CADA)的新计划,保留了CADA在减少通信负荷方面的优势,同时以工人额外储存的费用提高排挤者的活力; G-CADA将工人分成分配为配有相同数据支架的工人群体; 将各组安排在每次迭接时都适应性地安排,服务器只等待每个选定群体中最快的工人。 我们提供分析和实验结果,说明G-CADAD在隔时的重大收益,以及通信负荷和计算,以其他基准计划取代G-CADADA的通信负荷和负荷。