The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient inter-GPU communication, in particular via bandwidth overprovisioning. This support comes at a price: there is an order of magnitude cost difference between "cloud-grade" servers with such support, relative to their "consumer-grade" counterparts, although server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we investigate whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for communication compression. We show that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support: when training modern models and tasks to full accuracy, CGX provides self-speedups of 2-3X for an 8-GPU commodity node, enabling it to surpass the throughput of a much more expensive NVIDIA DGX-1 server. In the multi-node setting, CGX enables significant additional speedups by identifying and solving the novel adaptive compression problem, in which we can automatically set compression levels in a layer-wise fashion, balancing speedup and accuracy recovery.
翻译:扩大培训工作量的能力一直是深层学习的关键性能促进因素之一。主要的规模化方法是数据平行GPU基础培训,通过硬件和软件支持,对高效的GPU之间通信提供高效的硬件和软件支持,尤其是通过带宽过度提供,这种支持是有代价的:与“消费级”服务器相比,具有这种支持的“宽级”服务器与“消费级”服务器之间存在数量级的成本差异,尽管服务器级和消费者级GPU可以拥有类似的计算信封。在本文中,我们调查昂贵的硬件超载方法是否可以通过算法和系统设计加以更新,并提议一个称为CGX的框架,为通信压缩提供高效的软件支持。我们表明,在没有硬件支持的情况下,这一框架能够消除消费者级多级多级多级的多级服务器的通信瓶颈:当培训现代模型和任务达到完全准确性时,CGX为8-GPU的商品节点提供2-3x的自加速率,使其能够超过一个更昂贵的超速的硬件超时,通过算和系统设计,提出一个称为CVIADG-DG-DDD-ISMAx级的升级的升级的升级的升级服务器,从而确定一个具有新的升级的升级的升级的升级的系统。