MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242000x and MPI_Send speedup of up to 59000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes.
翻译:MPI 衍生数据类型是一种抽象, 简化了对 MPI 应用程序中非连接数据的处理。 这些数据类型是从 MPI 标准中定义的原始命名类型在运行时根据运行时根据原始命名类型递归构建的。 最近, CUDA- 觉察到 MPI 执行的开发和部署鼓励了分布式高性能 MPI 代码的转换到使用 GPU。 这种执行允许 MPI 函数直接在 GPU 缓冲上运行, 简化了 GPU 的整合到 MPI 代码中。 这项工作首先为嵌套套式数据类型数据类型提供了一个新颖的数据类型处理战略, 在先前工作中的专用或通用处理中找到一个中间点。 这项工作还表明, 非连接性数据处理的性能特性可以建模, 并用于透明地改进 MPI_ Send/ recv 透明地改进 MPI MPI 。 最后, 尽管对非连带性 GPUDA 2000 和 CUDA-awa MPI 执行过程进行大量关注, 但不能将良好的性能用于 。 在 MIPI IMI IML_ IMFI IMFI IML IMI IML 30 流 流 流 流 流 的运行 的运行中, 中, 的运行中, 的运行中, IMIS 的运行中, 3 IMPI 的运行中, 的运行 的运行中将 3 IMPL 。