In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents Myelin, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, Myelin is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, Myelin achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art.
翻译:在过去几年里,培训最先进的神经网络的记忆要求远远超过现代硬件加速器DRAM的DRAM能力。 这就需要开发高效的算法,在大规模基于 GPU 的集群上平行培训这些神经网络。 由于现代 GPU 的计算相对便宜, 设计和实施这些平行培训算法中极为高效的通信对于提取最大性能至关重要。 本文展示了Myelin, 这是一个平行的深层次学习框架,它利用了每个 GPU 的无节奏和信息驱动执行来安排神经网络操作,从而减少 GPU 闲置时间并最大限度地提高硬件效率。 通过利用 CPU 的记忆作为定期卸载数据的刮痕空间, Myelin 能够将 GPU 的记忆消耗量减少四倍。 这使得我们能够将GPU的参数数量增加四倍, 从而将通信量减少13%以上。 在测试大型变压模型时,在48-384 NVDIA Tesla V100 GPUPU 上, 将Geelin 的闲置时间减少22-257 % 的理论-257 速度, 将每州的15-257 % 降为每PU-254天的理论-25G-251天。