Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching a subset of the output in each time step in registers and shared memory to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of $2.29$x in small domains and $1.53$x in large domains), and a Krylov subspace solver (geometric mean speedup of $4.67$x in smaller SpMV datasets from SuiteSparse and $1.39$x in larger SpMV datasets, for conjugate gradient).
翻译:普通的 GPU 执行在主机侧有一个循环, 以尽可能长的时间/ 等步骤来引用 GPU 内核。 每个内核的终止暗含在每次推进解决方案之后所需的屏障中。 我们提议一个运行内存的迭代 GPU 内核的系统方案: perpsistent KernelS (PERKS) 。 在这个方案中, 时间环移动在一个持久性内核中, 并且使用全设备屏障来同步。 然后, 我们通过在登记册和共享内存的每个时间步骤中累积一部分输出, 以用作下一个时间步骤的输入, 从而减少对设备内存的流量。 PERKS 的终止是任何迭代式的屏障 。 我们解释 PERKS 的设计原则, 并展示 PERKS 在一系列的迭代 2D/3 D 电离心基准中的有效性( 平均速度在小域中为2.29美元, 在大域中为1.53美元, 在大域中为1.537美元) 的SBLISSplex 亚空间数据 。