In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents Myelin, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, Myelin is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12--100 billion parameters on 48--384 NVIDIA Tesla V100 GPUs, Myelin achieves a per-GPU throughput of 49.4--54.78% of theoretical peak and reduces the training time by 22-37 days (15--25% speedup) as compared to the state-of-the-art.
翻译:在过去几年里,培训最先进的神经网络的记忆要求远远超过现代硬件加速器DRAM的DRAM能力,这就需要开发高效算法,在大规模基于 GPU 的集群上平行培训这些神经网络。由于现代 GPU 的计算相对便宜,因此在这些平行培训算法中设计和实施极为高效的通信对于提取最大性能至关重要。本文件展示了Myelin,这是一个平行的深层次学习框架,它利用每个GPU的无节奏和信息驱动执行来安排神经网络操作,从而减少GPU闲置时间并最大限度地提高硬件效率。通过使用CPU记忆作为定期卸载数据的刮痕空间,Meelin能够将GPU的记忆消耗减少4倍。这使我们能够将GPU的参数数量增加4倍,从而将通信量减少13%以上。在用大型变压模型进行测试时,在48-384 NVDIA Tesla V100 GPU-PU-S-xxxxxxxxxxxxxxxxxxxxxx-xxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx