We consider straggler-resilient learning. In many previous works, e.g., in the coded computing literature, straggling is modeled as random delays that are independent and identically distributed between workers. However, in many practical scenarios, a given worker may straggle over an extended period of time. We propose a latency model that captures this behavior and is substantiated by traces collected on Microsoft Azure, Amazon Web Services (AWS), and a small local cluster. Building on this model, we propose DSAG, a mixed synchronous-asynchronous iterative optimization method, based on the stochastic average gradient (SAG) method, that combines timely and stale results. We also propose a dynamic load-balancing strategy to further reduce the impact of straggling workers. We evaluate DSAG for principal component analysis, cast as a finite-sum optimization problem, of a large genomics dataset, and for logistic regression on a cluster composed of 100 workers on AWS, and find that DSAG is up to about 50% faster than SAG, and more than twice as fast as coded computing methods, for the particular scenario that we consider.
翻译:我们考虑的是累进式的弹性学习。 在很多以前的工作,例如在编码计算文献中,折叠式的模型是随机的拖延,这种拖延是独立的,在工人之间分布相同。然而,在许多实际的假设中,特定工人可能会在很长的一段时间内折叠。我们建议一种延缓模式,可以捕捉这种行为,并通过在微软Azure、亚马逊网络服务(AWS)和小型本地集群上收集的微软Azure、亚马逊网络服务(AWS)和微小的微量痕迹加以证实。我们在这个模型的基础上,建议DSAG(一种混合同步的同步同步同步迭代最优化方法),它基于随机平均梯度(SAG)方法,将及时的和迟缓的结果结合起来。我们还提议了一个动态的负载平衡战略,以进一步降低悬浮工人的影响。我们评价DSAG(DAG)的主要部件分析,将大型基因组数据集作为限定的优化问题和由100名AWSAS工人组成的组的物流回归,并发现DSAG(SAG)大约快于50%,并将特定的计算方法作为快速的代码。