RoCE 消费控制政策对DNN的分布式培训的影响 (Impact of RoCE Congestion Control Policies on Distributed Training of DNNs)

RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes have been proposed to provide additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments. In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and network components and exclusively run training workloads using collectives (All-Reduce, All-To-All) communication libraries for communication. Furthermore, these platforms usually have a private network, separating their communication traffic from the rest of the datacenter traffic. Scalable topology-aware collective algorithms are inherently designed to avoid incast patterns and balance traffic optimally. These distinct features necessitate revisiting previously proposed congestion control schemes for general-purpose datacenter environments. In this paper, we thoroughly analyze some of the SOTA RoCE congestion control schemes vs. PFC when running on distributed training platforms. Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.

翻译：由于与传统以太网为基础的结构兼容性,RDMA系统在凝聚以太网(RECE)上对数据中心网络的吸引程度显著提高,然而,RDMA协议只在(近距离)无损网络上有效,强调罗埃网络拥堵控制的关键作用。不幸的是,基于优先流动控制(PFC)的本地RoCE拥堵控制计划有许多缺陷,如不公平、一线阻塞和僵局等。因此,近年来,许多计划提议为罗埃网络提供额外的拥堵控制,以尽量减少PFC的退缩。然而,这些计划是针对一般数据中心环境提出的。与使用商品硬件和运行一般用途工作量的一般数据中心相比,高性分布式培训平台部署高端加速器和网络组件,以及完全使用集体(全线、全线)通信图书馆来开展培训工作量。这些平台通常有一个私人网络,将它们的通信流量与数据中心运行流量的下降。预估的罗埃(RoCEFE)流程预设了我们所设计的最佳循环系统。