Deep Neural Networks have gained significant attraction due to their wide applicability in different domains. DNN sizes and training samples are constantly growing, making training of such workloads more challenging. Distributed training is a solution to reduce the training time. High-performance distributed training platforms should leverage multi-dimensional hierarchical networks, which interconnect accelerators through different levels of the network, to dramatically reduce expensive NICs required for the scale-out network. However, it comes at the expense of communication overhead between distributed accelerators to exchange gradients or input/output activation. In order to allow for further scaling of the workloads, communication overhead needs to be minimized. In this paper, we motivate the fact that in training platforms, adding more intermediate network dimensions is beneficial for efficiently mitigating the excessive use of expensive NIC resources. Further, we address different challenges of the DNN training on hierarchical networks. We discuss when designing the interconnect, how to distribute network bandwidth resources across different dimensions in order to (i) maximize BW utilization of all dimensions, and (ii) minimizing the overall training time for the target workload. We then implement a framework that, for a given workload, determines the best network configuration that maximizes performance, or performance-per-cost.
翻译:深神经网络因其在不同领域的广泛适用性而获得显著的吸引力。 DNN大小和培训样本不断增长,使培训工作量更加艰巨。分布式培训是缩短培训时间的一个解决办法。高性能分布式培训平台应利用多维等级网络,通过网络的不同级别将加速器相互连接,以大幅降低规模化网络所需的昂贵的NIC。然而,由于分布式加速器之间的通信管理费用以交换梯度或投入/产出激活为代价,因此,DNN规模和培训样本不断增长,使这类工作量的培训更具挑战性。为了进一步扩大工作量,需要最大限度地减少通信间接费用。在本文件中,我们鼓励以下事实:在培训平台中,增加更多的中间网络层面有利于有效减少过度使用昂贵的NI资源。此外,我们应对了在级别化网络上DNN培训的不同挑战。我们在设计这种连接时讨论了如何在不同层面分配网络带宽资源,以便(一) 最大限度地利用BW的所有层面,以及(二) 最大限度地减少目标工作量的总体培训时间。然后,我们实施一个框架,以优化业绩决定最佳业绩。