In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility.
翻译:在分布式机器学习(DML)中,培训数据分布于多个工人节点,用于平行进行基础培训。影响DML算法绩效的一个主要问题是存在分流器。这些节点在执行任务时非常缓慢,造成对储存在它们中的培训数据的利用不足。为此,梯度编码通过在数据中增加足够的冗余来减轻了分流器的影响。渐进编码和其他分流减缓计划假定工人节点的分流器行为是相同的。我们在亚马逊 AWS 群中的实验表明,情况并非如此。我们看到,在横流器跨迭的动作中存在着相关性。为了模型,我们采用了一个多式的分流器模型,将节点分为两个类别,缓慢和活跃。为了更好地利用以缓慢节点储存的培训数据,我们用工人的培训数据来修改现有的梯度编码计划。我们的成果(包括模拟和云层实验)表明,与现有计划相比有显著的改进。我们为这些模型的实用性进行了理论分析。