Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.
翻译:在过去十年里,深神经网络(DNN)的规模成倍增长,只有那些拥有大量基于数据的资源的人才有能力开发和培训这些模型。对于可能只有有限资源(例如,一个单一的多GPU服务器)的研究人员来说,长期尾巴的主要挑战之一是,与模型大小相比,GPU的记忆能力有限。问题如此尖锐,培训大规模DNN模型的记忆要求往往超过单个服务器上所有可用的GPU的总容量;随着模型规模不断增长的趋势,这一问题只会变得更加严重。目前依靠虚拟化GPU记忆(通过交换/交换CPU记忆)的解决方案需要过度转换间接费用。在本文中,我们提出了一个新的培训框架,即和谐,并主张重新思考DNNF框架如何计算和移动数据,以便在单个商品服务器上有效推进大规模模型培训的界限。在各种大规模DNNN模型中,和谐能够将交换负荷减少至两个数量级,并获得在高度优化的基线上通过投入速度达到7.6x的训练,实现虚拟记忆。