Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.
翻译:迭代内存限制求解器在高性能计算代码中很常见。通常GPU实现在宿主端有一个循环,每个循环调用一次GPU内核,直到时间/算法步骤完成。每个内核的终止在推进解的每一个时间步骤后隐式地充当所需的障碍。我们提出了一种运行内存限制的迭代GPU内核的执行模型:持久内核(PERKS)。在这个模型中,时间循环移到持久内核内部,并使用设备范围内的障碍进行同步。然后,我们通过将每个时间步骤的输出的子集缓存在未使用的寄存器和共享内存中,来减少对设备内存的流量。PERKS可以推广到任何迭代求解器:它们在很大程度上独立于求解器的实现。我们解释了PERKS的设计原则,并展示了PERKS在广泛的迭代2D/3D模板基准测试(2D模板的平均速度提高了2.12倍,3D模板的平均速度提高了1.24倍),以及一个Krylov子空间共轭梯度求解器中的有效性(在SuiteSparse的较小SpMV数据集中平均速度提高了4.86倍,在较大的SpMV数据集中比现有库快了1.43倍)。基于PERKS的所有实现均可在以下网址获得:https://github.com/neozhang307/PERKS。