Many big data algorithms executed on MapReduce-like systems have a shuffle phase that often dominates the overall job execution time. Recent work has demonstrated schemes where the communication load in the shuffle phase can be traded off for the computation load in the map phase. In this work, we focus on a class of distributed algorithms, broadly used in deep learning, where intermediate computations of the same task can be combined. Even though prior techniques reduce the communication load significantly, they require a number of jobs that grows exponentially in the system parameters. This limitation is crucial and may diminish the load gains as the algorithm scales. We propose a new scheme which achieves the same load as the state-of-the-art while ensuring that the number of jobs as well as the number of subfiles that the data set needs to be split into remain small.
翻译:在类似于MapReduce的系统中执行的许多大数据算法都有一个摇篮阶段,通常支配整个工作执行时间。最近的工作已经展示了可以将洗篮阶段的通信负荷转换为地图阶段的计算负荷的计划。在这项工作中,我们侧重于在深层学习中广泛使用的分布式算法类别,中间计算同一任务可以合并。尽管先前的技术大大减少了通信负荷,但是它们需要一些在系统参数中成倍增长的工作。这一限制至关重要,并可能减少算法尺度的负载增益。我们提出了一个新方案,既要达到与最先进的算法相同的负荷,又要确保数据集需要拆分的分数以及子文件的数量仍然很小。