Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only those who have access to massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments. Across many large DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.
翻译:过去十年来,深神经网络(DNN)的复杂程度和规模都成倍增长,只有那些能够获得大量基于数据中心的资源的人才有能力开发和培训这些模型。对于可能只获得有限资源(例如,单一的多GPU服务器)的研究人员来说,长期尾巴的主要挑战之一是,与模型大小相比,GPU的记忆能力有限。问题如此尖锐,培训大型DNN模型的记忆要求往往超过商品服务器上所有可用的GPU总容量;随着模型规模不断增长的趋势,这一问题只会变得更加严重。目前依靠虚拟化GPU记忆(通过交换/从CPU记忆中交换)的解决方案造成了过度的顶级转换。在本文中,我们提出了一个新的培训框架“和谐”并倡导重新思考DNNF框架如何计算和移动数据,以便在小型多CPU部署上有效推进大型模型培训的界限。在许多大型DNNM模型中,和谐能够将交换工作量减少至两个数量级,并获得最高至7.6x的顶峰速培训。