MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents a new MPI library, TEMPI, to address this issue. TEMPI first introduces a common datatype to represent equivalent MPI derived datatypes. TEMPI can be used as an interposed library on existing MPI deployments without system or application changes. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that previously preferred "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242,000x and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 1000x in a 3D halo exchange at 192 ranks.
翻译:MPI 衍生数据类型是一个抽象的抽象, 它简化了MPI 应用程序中非连接数据的处理。 这些数据类型在运行时由 MPI 标准定义的原始命名类型在运行时根据原始命名类型进行递归构建。 最近, CUDA- 觉察到 MPI 的实施鼓励了分布式高性能 MPI 代码的转换, 以使用 GPU 。 这种执行允许 MPI 函数在 GPU 缓冲上直接操作, 简化 GPU 的整合到 MPI 代码中。 尽管非常关注 CUDA- 觉悟性 MPI 执行, 但它们在操作 GPUPS 的衍生数据类型时, 继续提供非常差的 GPUPU 性能。 这项工作为 一个新的 MPI 库, TEMPI 的开发和部署速度2 000 。 TEMPI 首次引入一个共同的数据类型, 以代表等量 MPI 数据类型 。 此外, 将 GPO 生成的数据类型处理模式显示 GUUD- liver listal- listal- lipplemental- slopplemental 5- suplementlemental 和 MAx MAx 等 MAx 等 速度方法, 。