The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of $n$ workers, which iteratively compute updates of the model parameters, and a stateful PS, which waits and aggregates all updates to generate a new estimate of model parameters and sends it back to the workers for a new iteration. Transient computation slowdowns or transmission delays can intolerably lengthen the time of each iteration. An efficient way to mitigate this problem is to let the PS wait only for the fastest $n-b$ updates, before generating the new parameters. The slowest $b$ workers are called backup workers. The optimal number $b$ of backup workers depends on the cluster configuration and workload, but also (as we show in this paper) on the hyper-parameters of the learning algorithm and the current stage of the training. We propose DBW, an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune $b$ by preliminary time-consuming experiments, and 2) makes the training up to a factor $3$ faster than the optimal static configuration.
翻译:对机器学习模型进行分布式培训的最受欢迎的框架是(同步)参数服务器(PS)。这一范式包括以美元为单位的工人,他们反复计算模型参数的更新情况,以及一个有声的PS,他们等待和汇总所有更新情况,以产生模型参数的新估计,然后将模型参数反馈给工人进行新的迭代。中度计算速度减慢或传输延迟可以不耐烦地延长每次迭代的时间。缓解这一问题的一个有效办法是让PS在产生新参数之前只等待最快的一美元-一美元更新。最慢的美元工人被称为后备工人。最慢的美元工人是称为后备工人。最佳的美元支持工人数目取决于组群配置和工作量,但也取决于学习算法和当前培训阶段的超参数(如本文所示)。我们建议DBW,一种动态地决定培训过程中后备工人人数的算法,以最大限度地提高每次迭代的趋速度。我们的实验表明DBW1号将必要从3美元调高的固定系数提高到最迅速的试验。