We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.
翻译:我们在大型模型平行深层学习(DL)中研究一种新型和重要的通信模式,我们称之为跨模层重新硬化(DL),我们称之为跨模层重新硬化。当模型平行主义的两个模式 — — 内部操作器和间操作器平行 — — 组合起来支持大型集群的大型模型时,这种模式就出现了。在跨模层重新硬化过程中,需要从源设备网格向目的地设备网格发送一个碎裂式的电压,在这种网格上可以用相同的或不同的布局来分配电压。我们将此正规化成一个多盘多盘化的通信问题,并表明现有的方法要么是次优化的,要么不是泛化为不同的网络结构或阵列。这些模式是不同的模型和平行战略产生的。在交叉碎裂中,我们然后提出两种贡献:一个高效的基于广播的通信系统,以及一个“超重叠的”管道时间表。在微谱标记上,我们的总体系统将现有系统比现有系统高出10x,横跨多个索尔和中间的多盘,显示现有的方法,要么是不同的网络结构结构,要么是50-3版,然后分别改进了我们最后到10个模型,然后改进了我们最后的10个模型,然后改进了我们最后到最后的10个。