Modern large-scale deep learning workloads highlight the need for parallel execution across many devices in order to fit model data into hardware accelerator memories. In these settings, array redistribution may be required during a computation, but can also become a bottleneck if not done efficiently. In this paper we address the problem of redistributing multi-dimensional array data in SPMD computations, the most prevalent form of parallelism in deep learning. We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. Array redistribution for SPMD computations using collective operations has also been implemented in the context of the XLA SPMD partitioner, a production-grade tool for partitioning programs across accelerator systems. We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22\times$, with maximum speedups as a high as $5.7\times$, while offering provable memory guarantees, making our system particularly appealing for large-scale models.
翻译:现代大型深层学习工作量强调,需要在许多设备中平行执行,以便将模型数据纳入硬件加速器记忆中。在这些环境中,可能需要在计算过程中进行阵列再分配,但如果不高效地进行,它也可能成为一个瓶颈。在本文中,我们处理在SPMD计算中重新分配多维阵列数据的问题,这是深层学习中最普遍的平行形式。我们提出了一种类型导向方法,将阵列再分配作为MPI式集体作业的序列加以综合。我们正式证明,我们综合的阵列再分配是记忆效率高的,没有进行过多的数据传输。在XLA SPM分区中,也实施了利用集体作业对SPMD计算进行的阵列再分配,这是用于跨越加速器系统的分区程序的一个生产级工具。我们对照XLA执行方法评估了我们的方法,发现我们的方法提供了1.22\时间的几何平均速度,最高速度高达5.7小时,同时提供了可变存储保证,使我们的系统特别吸引大型模型。