OCCL: GPU集体通信无死锁图书馆</s> (OCCL: a Deadlock-free Library for GPU Collective Communication)

Various distributed deep neural network (DNN) training technologies lead to increasingly complicated use of collective communications on GPU. The deadlock-prone collectives on GPU force researchers to guarantee that collectives are enqueued in a consistent order on each GPU to prevent deadlocks. In complex distributed DNN training scenarios, manual hardcoding is the only practical way for deadlock prevention, which poses significant challenges to the development of artificial intelligence. This paper presents OCCL, which is, to the best of our knowledge, the first deadlock-free collective communication library for GPU supporting dynamic decentralized preemption and gang-scheduling for collectives. Leveraging the preemption opportunity of collectives on GPU, OCCL dynamically preempts collectives in a decentralized way via the deadlock-free collective execution framework and allows dynamic decentralized gang-scheduling via the stickiness adjustment scheme. With the help of OCCL, researchers no longer have to struggle to get all GPUs to launch collectives in a consistent order to prevent deadlocks. We implement OCCL with several optimizations and integrate OCCL with a distributed deep learning framework OneFlow. Experimental results demonstrate that OCCL achieves comparable or better latency and bandwidth for collectives compared to NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can improve the peak training throughput by up to 78% compared to statically sequenced NCCL, while introducing overheads of less than 6.5% across various distributed DNN training approaches.

翻译：各种分布式的深神经网络(DNN)培训技术导致GPU上集体通信的使用日益复杂。GPU上容易陷入僵局的集体集体确保集体在每一个GPU的一致秩序下聚集起来,以防止出现僵局。在复杂的分布式DNN培训情景中,人工硬编码是防止僵局的唯一实用方法,对人工智能的发展构成重大挑战。本文展示了OCCL,这是GPU第一个无僵局的集体通信图书馆,根据我们的知识,它支持动态分散式的先发制人和集体安排。GPU上的集体研究人员可以保证集体集体在一致秩序中聚集起来,而OCCL则通过无僵局的集体执行框架,以动态分散式的分散式的方式预设集体防止僵局。在OCCLCR的帮助下,研究人员不必再努力让所有GPUS进入稳定的集体,从而防止僵局。我们实施了几次优化,并将OCCLL的预设式机会与分布式的低轨道轨道轨道结构相结合,同时通过可比较性集体的GLLAL 实验演示结果,通过较低的GL值的GLA 进行更好的集体培训,通过较低的GL 进行更低的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的测试,可以显示。</s>