Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171$\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16$\times$ and 7.8-58$\times$ reduction in Megatron and DLRM training time respectively} while offering 42-53$\times$ and 3.3-12.4$\times$ improvement in energy consumption and cost respectively.
翻译:目前的电子包交换网络架构和技术有各种直径表、低分带带带宽和超额订阅,影响到通信和集体行动的完成时间;我们采用近直径表、全分带带宽、全全分带宽、全到全、单呼、全光网络架构,其纳米第二重组称为RAMP,支持大规模分布和平行计算系统(12.8~Tbps/每个节点,最多65,536节点),目前电子包交换网络架构和技术有不同直径表层、低分带带带带带带带宽和超额订阅,影响到通信和集体行动的完成时间;我们采用近直径、全分带带带宽带宽带宽带宽带宽带宽带宽、全至全包带宽带宽、全至全包、单点、全视网架构,与现实的EMPS和OCS对口相比,在所有MPI业务中,完成时间达到7.6-171美元,每小时超时超时超速,还可分别提供1.3-16美元和7.8-58美元耗时和3.53小时的耗资。