We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training workloads. TopoOpt co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. We demonstrate the mutability of AllReduce traffic, and leverage this property to construct efficient network topologies for DNN training jobs. TopoOpt then uses an alternating optimization technique and a group theory-inspired algorithm called TotientPerms to find the best network topology and routing plan, together with a parallelization strategy. We build a fully functional 12-node direct-connect prototype with remote direct memory access (RDMA) forwarding at 100 Gbps. Large-scale simulations on real distributed training models show that, compared to similar-cost Fat-Tree interconnects, TopoOpt reduces DNN training time by up to 3.4x.
翻译:我们提出TopOpt, 这是一种用于深神经网络培训工作量的新颖的直接连接结构。 TopoOpt 共同优化分布式培训过程,涉及三个方面:计算、通信和网络地形学。我们展示了AllReduce交通的可变性,并利用这一特性为DNN培训工作构建高效的网络地形。Topopt随后使用一种交替优化技术和一种由理论启发的集成算法,称为TotientPerms,以找到最佳网络地形学和路由计划,同时采用平行战略。我们建立了一个功能齐全的12节直接连接原型,在100Gbps进行远程直接存储(RDMA)传输。关于实际分布式培训模型的大规模模拟显示,与类似成本的Fat-Tre连接相比, Topopt将DNN的培训时间减少至3.4x。