MPI Derived Datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. These implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents a new MPI library, TEMPI, to address this issue. TEMPI first introduces a common datatype to represent equivalent MPI derived datatypes. TEMPI can be integrated into existing MPI deployments as an interposed library. This library can be used regardless of MPI deployment and without modifying application code. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that previously preferred "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 720,400x and MPI\_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 1000x in a 3D halo exchange at 192 ranks.
翻译:MPI 生成数据类型是一种抽象, 它简化了对 MPI 应用程序中非连接数据的处理。 这些数据类型在运行时从 MPI 标准定义的原始命名类型中循环构造。 最近, CUDA- 觉察到 MPI 的实施鼓励将分布式高性能 MPI 代码转换为使用 GPU 数据类型。 这些执行允许 MPI 函数直接操作 GPU 缓冲, 简化 GPU 的整合到 MPI 代码中。 尽管非常关注 CUDA- 觉察到 MPI 的 MPI 执行程序, 但是在操作 GPU 衍生数据类型时, 这些数据类型在运行时仍然提供非常差的 GPUDA 。 TEMPI 首次引入一个通用的数据类型来代表等效的 MPI 衍生数据类型。 TEMPI 可以作为内部配置库的部署方式纳入现有的 MPI 。 这个图书馆可以使用, 不论 MPI 部署和 不修改应用程序代码 。 此外, 这项工作展示了GPUD- IMB- AS AS AS AS 的递增速度 的递增速度 模式, 的运行中, 在最终演示中, 演示中, 演示中, MAPI 5- sl 的 的 MAPI MAF_ 的 的 AS AS 的运行中, MAF MAL MAU 的递归 MAL MAL MAL MAL 。