Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective communication such as AllReduce has been widely adopted for the distributed SGD process to reduce the communication time. However, AllReduce incurs large bandwidth resources while most gradients are sparse in many cases since many gradient values are zeros and should be efficiently compressed for bandwidth saving. To reduce the sparse gradient communication overhead, we propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap, and enables the efficient AllReduce operators for parallel SGD training. We perform extensive evaluation against four state-of-the-art methods over five training models. Our results show that S2 Reducer converges to the same accuracy, reduces 81\% sparse communication overhead, and achieves 1.8$ \times $ speedup compared to state-of-the-art approaches.
翻译:在大型深层学习中广泛采用分布式梯度梯度下降(SGD)方法,而梯度集体方法对于确保分布式深层学习系统的培训可扩展性至关重要; 分散式 SGD 进程广泛采用AllRedue等集体通信,以减少通信时间; 然而, AllReduce 产生大型带宽资源,而由于许多梯度值为零,在许多情况下,大多数梯度是稀疏的,应当为节省带宽而有效压缩; 为了减少稀薄的梯度通信间接费用,我们建议采用基于草图的稀释式稀释梯度汇总法(S2递减器),这是一种具有趋同保证的新颖的、基于草图的稀释梯度汇总法。 S2 降低通信成本,只需用点数和位图压缩非零梯度梯度来压缩非零梯度梯度的通信费用,使高效的Alluceuse操作员能够进行平行SGD培训。 我们对五个培训模式的四种最先进的方法进行了广泛的评价。 我们的结果表明,S2 降低器与相同的精度接近于相同的精度,减少了81 ⁇ 稀释式通信间接费用,减少了81 ⁇ 稀释式通信间接费用,并达到1.8美元。