As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.
翻译:随着新兴的深层神经网络(DNN)模式继续扩大规模,使用大型GPU集群来培训DNN(DNN)模式正在成为实现可接受的培训时间的一个基本要求。在本文件中,我们考虑了这样的情况,即未来集群规模的增加将会导致全球批量规模的扩大,从而可以用来培训模型,达到一个根本的极限:超过某一点,更大的全球批量规模导致取样效率下降,增加总体时间的准确性。因此,为了进一步改善培训业绩,我们必须考虑“大幅缩放”战略,使全球批量的常数保持不变,并将较小的批量分配到每个GPU。不幸的是,这就使得要高效使用集群资源要困难得多。我们介绍了Deep Pool,这是一个通过两个关键想法应对这一效率挑战的系统。首先,爆发平行主义分配了大量的GPUPU进行地面作业,以便利用各层平行的不均势。第二,GPU将地面培训的吞排量排在地面培训工作上,同时将背景培训工作包装成回收利用不到的GPUPU资源,从而改善整个集群的利用情况。这两个想法使GPO(GPOL)能够使GM)在1.2级上实现1.x级的平行的大规模改进。