In scalable machine learning systems, model training is often parallelized over multiple nodes that run without tight synchronization. Most analysis results for the related asynchronous algorithms use an upper bound on the information delays in the system to determine learning rates. Not only are such bounds hard to obtain in advance, but they also result in unnecessarily slow convergence. In this paper, we show that it is possible to use learning rates that depend on the actual time-varying delays in the system. We develop general convergence results for delay-adaptive asynchronous iterations and specialize these to proximal incremental gradient descent and block-coordinate descent algorithms. For each of these methods, we demonstrate how delays can be measured on-line, present delay-adaptive step-size policies, and illustrate their theoretical and practical advantages over the state-of-the-art.
翻译:在可扩缩的机器学习系统中,示范培训往往与没有同步的多个节点平行进行。对相关非同步算法的大多数分析结果都使用系统中信息延迟的上限来确定学习率。不仅难以事先获得这样的界限,而且还导致不必要的缓慢趋同。在本文中,我们表明有可能使用取决于系统实际时间变化延误的学习率。我们为延迟适应性非同步迭代制定了总体趋同结果,并将这些结果专门用于近似增量梯度下降和块状相协调的血缘算法。对于其中每一种方法,我们展示了如何在网上衡量延迟,目前的延迟适应性梯度政策,并展示其相对于最新技术的理论和实践优势。