一种面向GPU的自适应分布式模板计算抽象 (An Adaptive Distributed Stencil Abstraction for GPUs)

The scientific computing ecosystem in Python is largely confined to single-node parallelism, creating a gap between high-level prototyping in NumPy and high-performance execution on modern supercomputers. The increasing prevalence of hardware accelerators and the need for energy efficiency have made resource adaptivity a critical requirement, yet traditional HPC abstractions remain rigid. To address these challenges, we present an adaptive, distributed abstraction for stencil computations on multi-node GPUs. This abstraction is built using CharmTyles, a framework based on the adaptive Charm++ runtime, and features a familiar NumPy-like syntax to minimize the porting effort from prototype to production code. We showcase the resource elasticity of our abstraction by dynamically rescaling a running application across a different number of nodes and present a performance analysis of the associated overheads. Furthermore, we demonstrate that our abstraction achieves significant performance improvements over both a specialized, high-performance stencil DSL and a generalized NumPy replacement.

翻译：Python科学计算生态系统在很大程度上局限于单节点并行，这导致NumPy高级原型设计与现代超级计算机高性能执行之间存在鸿沟。硬件加速器的日益普及与能效需求使得资源自适应性成为关键要求，然而传统高性能计算抽象仍保持僵化。为应对这些挑战，我们提出了一种面向多节点GPU模板计算的自适应分布式抽象。该抽象基于自适应Charm++运行时框架CharmTyles构建，采用类NumPy的熟悉语法以最小化从原型到生产代码的移植成本。我们通过在不同节点数间动态重缩放运行中的应用程序，展示了该抽象的资源弹性特性，并对相关开销进行了性能分析。此外，我们证明该抽象相比专用高性能模板领域特定语言及通用NumPy替代方案均实现了显著的性能提升。