The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
翻译:扩大培训工作量的能力一直是深层学习的关键性能促进因素之一。 主要规模化方法就是数据平行 GPU 的基于GPU 的培训,通过硬件和软件支持促进高效点对点通信,特别是硬件带宽过度提供。 超额供应成本是成本的: 相对于广受欢迎的“ 消费者级” 服务器而言,“ 库级” 服务器与广受欢迎的“ 消费者级” 服务器之间存在数量级的价差, 尽管单一服务器级和消费者级的GPU 具有类似的计算平衡。 在本文中,我们显示,昂贵的硬件超额供应方法可以通过算法和系统设计来升级,并提议一个名为 CGX 的框架,为ML 应用程序的压缩通信提供高效的软件支持,用于多- GPUP单节培训,以及规模更大的多节点培训。 CGX 以两种技术进步为基础:\em- 平流流压级(x级) 和不易变缩缩缩缩略度(x) 框架。 它依靠重新开发的ML框架的通信数据库, 提供灵活、高效率化- 解缩缩化(x) 和升级(x) 网络化) 和升级(Mx级化) 升级) 和升级的逻辑化) 的升级的升级的软件框架。