使用GPU RDMA 的环光电磁减少存储器来分配量子蒙特卡洛溶解器 (Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver)

Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper discusses how device memory-bound challenges were successfully reduced by proposing an effective "all-to-all" communication method -- a ring communication algorithm. This implementation takes advantage of acceleration on GPUs and remote direct memory access (RDMA) for fast data exchange between GPUs. Additionally, the ring algorithm was optimized with sub-ring communicators and multi-threaded support to further reduce communication overhead and expose more concurrency, respectively. The computation and communication were also analyzed by using the Autonomic Performance Environment for Exascale (APEX) profiling tool, and this paper further discusses the performance trade-off for the ring algorithm implementation. The memory analysis on the ring algorithm shows that the allocation size for the authors' most memory-intensive data structure per GPU is now reduced to 1/p of the original size, where p is the number of GPUs in the ring communicator. The communication analysis suggests that the distributed Quantum Monte Carlo execution time grows linearly as sub-ring size increases, and the cost of messages passing through the network interface connector could be a limiting factor.

翻译：在领导计算机设施上运行的科学应用往往由于记忆限制(模拟应用程序)而无法将领先的科学案例与加速器设备相匹配,这往往面临挑战。在这项工作中,作者研究了美国能源部的一个类似美国能源部的任务-关键压缩物质物理应用,动态集群组合匹配(DCA+++),本文讨论了如何通过提出有效的“全到全”通信方法(环比通信算法)成功减少设备记忆挑战。这一实施利用了GPUs的加速和远程直接存储访问(RDMA),用于GPU之间的快速数据交换。此外,环比算法还优化了次链接的通信算法和通信支持,以进一步减少通信的间接费用和暴露更多的调色调。计算和通信还利用Exascale(APEX)分析工具的自动性能环境,本文进一步讨论了实施环比算法的性交易。环比算法的记忆分析显示,GPUPU中最耐久的存储数据结构的配置规模和多读式支持,现在缩小到GPUPOL网络的原始规模。