Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose Espresso to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable Espresso to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that Espresso can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.
翻译:重力压缩(GC)是解决分布式深层学习(DDL)中通信瓶颈问题的一个很有希望的方法。 但是,要找到将GC应用到DDL的最佳压缩战略是困难的。 要充分释放GC的好处,就必须解决两个问题:(1) 如何表达所有压缩战略和任何DDL培训工作的各行各业之间的相应互动?(2) 如何快速选择近于最佳的压缩战略?在本文件中,我们建议埃斯普雷斯索回答这些问题。它首先设计决策树抽象,以表达所有压缩战略,并开发将GC应用到DDL的实验模型,以便让Espresso能够掌握Exor计算、通信和压缩之间的复杂互动。然后,它设计一个压缩决定算法,用来分析所有压缩战略的多重互动,并优化战略的顺序,以及优化地卸载压缩到CPUs。实验性评估显示,埃斯普雷斯索可以改进对开始的系统的培训,最多达77 %。此外,从代表DDL培训工作中选择最佳战略所需的计算时间仅以毫秒衡量。