This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.
翻译:本文展示了TAG, 这是一种用于优化 DNN 培训图的自动系统, 并将其应用到任何设备表层中, 用于加速设备与地形差异 ML 组的快速培训。 我们将 DNN 计算图和设备表层图作为输入图形神经网络( GNN) 的输入, 并加入 GNN 搜索方法, 以快速确定最佳分布式培训战略。 为了减少不同组的通信, 我们进一步探索一种无损梯度压缩技术, 并解决组合优化问题, 以自动应用培训时间最小化技术 。 我们用各种具有代表性的 DNNN 模型和设备表层来评估 TAG, 表明与现有计划相比, 它可以达到4.56x 培训速度。 TAG 可以为看不见 DNN 模型和无形设备表层生成高效的部署战略, 而不进行大量微调 。