The near-field (P2P) operator in the Multilevel Fast Multipole Algorithm (MLFMA) is a performance bottleneck on GPUs due to poor memory locality. This work introduces data redundancy to improve spatial locality by reducing memory access dispersion. For validation of results, we propose an analytical model based on a Locality metric that combines data volume and access dispersion to predict speedup trends without hardware-specific profiling. The approach is validated on two MLFMA-based applications: an electromagnetic solver (DBIM-MLFMA) with regular structure, and a stellar dynamics code (PhotoNs-2.0) with irregular particle distribution. Results show up to 7X kernel speedup due to improved cache behavior. However, increased data volume raises overheads in data restructuring, limiting end-to-end application speedup to 1.04X. While the model cannot precisely predict absolute speedups, it reliably captures performance trends across different problem sizes and densities. The technique is injectable into existing implementations with minimal code changes. This work demonstrates that data redundancy can enhance GPU performance for P2P operator, provided locality gains outweigh data movement costs.
翻译:多层快速多极算法(MLFMA)中的近场(P2P)算子在GPU上因内存局部性差而成为性能瓶颈。本研究引入数据冗余,通过减少内存访问分散度来改善空间局部性。为验证结果,我们提出一种基于局部性度量的分析模型,该模型结合数据量和访问分散度来预测加速比趋势,无需依赖特定硬件的性能剖析。该方法在两个基于MLFMA的应用中得到验证:具有规则结构的电磁求解器(DBIM-MLFMA)和具有不规则粒子分布的恒星动力学代码(PhotoNs-2.0)。结果显示,由于缓存行为改善,内核计算加速比最高可达7倍。然而,数据量增加导致数据重构开销上升,使得端到端应用加速比限制在1.04倍。虽然该模型无法精确预测绝对加速比,但能可靠捕捉不同问题规模和密度下的性能趋势。该技术可通过最小代码修改注入现有实现。本研究表明,只要局部性收益超过数据移动成本,数据冗余即可提升P2P算子在GPU上的性能。