Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
翻译:虽然同步方法快速趋同,但对于挤压者来说并不强大;非同步方法可以减少每轮倒数24小时的时间,但因梯度的腐蚀性而使趋同率降低,因此自然要将这两种方法结合起来,以取得平衡。在这项工作中,我们制定一种新的非同步战略,利用同步方法和非同步方法的优势,称为适应性约束性粘土(ABS)的优势。ABS的关键推进因素是两重。首先,PS每轮等待加速聚集的工人人数是经调整后选定的,以达到螺旋状粘合平衡。第二,相对高的惯性工人需要开始新一轮的计算,以缓解粘性的负面影响。提供了模拟结果,以显示ABS在时钟和通信轮方面优于状态。</s>