MPI派生数据类型是否真正有益？基于共享内存通信的单节点跨实现研究 (Do MPI Derived Datatypes Actually Help? A Single-Node Cross-Implementation Study on Shared-Memory Communication)

MPI's derived datatypes (DDTs) promise easier, copy-free communication of non-contiguous data, yet their practical performance remains debated and is often reported only for a single MPI stack. We present a cross-implementation assessment using three 2D applications: a Jacobi CFD solver, Conway's Game of Life, and a lattice-based image reconstruction. Each application is written in two ways: (i) a BASIC version with manual packing and unpacking of non-contiguous regions and (ii) a DDT version using MPI_Type_vector and MPI_Type_create_subarray with correct true extent via MPI_Type_create_resized. For API parity, we benchmark identical communication semantics: non-blocking point-to-point (Irecv/Isend + Waitall), neighborhood collectives (MPI_Neighbor_alltoallw), and MPI-4 persistent operations (*_init). We run strong and weak scaling on 1-4 ranks, validate bitwise-identical halos, and evaluate four widely used MPI implementations: MPICH, Open MPI, Intel MPI, and MVAPICH2 on a single ARCHER2 node. Results are mixed. DDTs can be fastest, for example for the image reconstruction code on Intel MPI and MPICH, but can also be among the slowest on other stacks, such as Open MPI and MVAPICH2 for the same code. For the CFD solver, BASIC variants generally outperform DDTs across semantics, whereas for Game of Life the ranking flips depending on the MPI library. We also observe stack-specific anomalies, for example MPICH slowdowns with DDT neighborhood and persistent modes. Overall, no strategy dominates across programs, semantics, and MPI stacks; performance portability for DDTs is not guaranteed. We therefore recommend profiling both DDT-based and manual-packing designs under the intended MPI implementation and communication mode. Our study is limited to a single node and does not analyze memory overhead; multi-node and GPU-aware paths are left for future work.

翻译：MPI的派生数据类型（DDT）承诺以更简便、无拷贝的方式实现非连续数据的通信，但其实际性能仍存争议，且通常仅针对单一MPI实现栈进行报告。本文通过三项二维应用开展跨实现评估：Jacobi CFD求解器、康威生命游戏以及基于格点的图像重建。每个应用均采用两种方式编写：(i) 基础版本（BASIC）通过手动打包/解包处理非连续区域；(ii) DDT版本使用MPI_Type_vector与MPI_Type_create_subarray，并通过MPI_Type_create_resized确保正确的真实跨度。为实现API对等性，我们测试了完全相同的通信语义：非阻塞点对点通信（Irecv/Isend + Waitall）、邻域集合通信（MPI_Neighbor_alltoallw）以及MPI-4持久化操作（*_init）。我们在1-4个进程上执行强扩展与弱扩展测试，验证了比特级一致的边界数据，并评估了四种广泛使用的MPI实现：MPICH、Open MPI、Intel MPI和MVAPICH2（均运行于单个ARCHER2节点）。结果呈现分化态势：DDT可能达到最快速度（例如图像重建代码在Intel MPI与MPICH上），但在其他实现栈中也可能成为最慢方案（如同代码在Open MPI与MVAPICH2中的表现）。对于CFD求解器，基础版本在各类通信语义中普遍优于DDT；而在生命游戏中，性能排序则随MPI库的不同发生逆转。我们还观察到实现栈特有的异常现象，例如MPICH在使用DDT的邻域通信与持久化模式时出现性能下降。总体而言，没有任何策略能在所有程序、通信语义和MPI实现栈中保持优势；DDT的性能可移植性无法得到保证。因此我们建议：在目标MPI实现与通信模式下，应同时对基于DDT的设计与手动打包设计进行性能剖析。本研究受限于单节点环境且未分析内存开销，多节点及GPU感知路径留待未来工作探索。