We explore a novel approach for building DNN training clusters using commodity optical devices. Our proposal, called TopoOpt, co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. TopoOpt uses a novel alternating optimization technique and a group theory-inspired algorithm to find the best network topology and routing plan, together with parallelization strategy, for distributed DNN training. To motivate our proposal, we measure the communication patterns of distributed DNN workloads at a large online service provider. Experiments with a 12-node prototype demonstrate the feasibility of TopoOpt. Simulations on real distributed training models show that, compared to similar-cost FatTree interconnects, TopoOpt reduces DNN training time by up to 3x.
翻译:我们探索了一种利用商品光学设备建立DNN培训集群的新办法。我们的提案称为TopoOpt, 共同优化分布式培训过程,涉及三个方面:计算、通信和网络地形学。 TopoOpt使用一种新型的交替优化技术和一个集体理论推导算法来寻找最佳网络地形和路线计划,连同平行战略,以进行分布式DNN培训。为了激励我们的提案,我们用一个大型在线服务提供商的分布式DNN工作量的通信模式来衡量。一个12节原型的实验展示了Topopt的可行性。真实分布式培训模型的模拟显示,与类似成本的FatTree互联连接相比,Topoopt将DNN培训时间减少多达3x。