This paper focuses on mitigating the impact of stragglers in distributed learning system. Unlike the existing results designed for a fixed number of stragglers, we developed a new scheme called Adaptive Gradient Coding(AGC) with flexible tolerance of various number of stragglers. Our scheme gives an optimal tradeoff between computation load, straggler tolerance and communication cost. In particular, it allows to minimize the communication cost according to the real-time number of stragglers in the practical environments. Implementations on Amazon EC2 clusters using Python with mpi4py package verify the flexibility in several situations.
翻译:本文侧重于减轻分布式学习系统中排泄器的影响。与为固定数量的排泄器设计的现有结果不同,我们开发了一个新的方案,称为适应性梯度编码(AGC),灵活地容忍不同数量的排流器。我们的方案在计算负荷、排流容忍度和通信成本之间提供了最佳的权衡。特别是,它能够根据实际环境中的排流器的实时数量来尽量减少通信成本。在亚马逊EC2群群中,使用 Python 和 mpi4py 软件包对多种情况下的灵活性进行验证。