Recent research has focused on accelerating stencil computations by exploiting emerging hardware like Tensor Cores. To leverage these accelerators, the stencil operation must be transformed to matrix multiplications. However, this transformation introduces undesired sparsity into the kernel matrix, leading to significant redundant computation. In this paper, we present SPIDER, the first system to turn this unresolved sparsity into an optimization opportunity by exploring the potential of Sparse Tensor Cores (SpTCs) for stencil acceleration. Specifically, SPIDER introduces an efficient and elegant transformation method that integrates two cooperative techniques: an ahead-of-time strided swapping transformation for kernel matrices and an on-the-fly row-swapping mechanism for inputs. This rule-based approach effectively transforms stencil computation into operations compatible with SpTCs, introducing only slight compile-time overhead and zero runtime overhead. Additionally, SPIDER incorporates multiple optimizations to maximize computational efficiency. Experimental evaluations demonstrate that SPIDER outperforms vendor library cuDNN by 6.20$\times$ and state-of-the-art (SOTA) Tensor Core-based approaches (ConvStencil, FlashFFTStencil, etc.) by 2.00$\times$ on average.
翻译:近期研究聚焦于利用张量核心等新兴硬件加速模板计算。为充分发挥这些加速器效能,需将模板操作转换为矩阵乘法。然而,该转换会在核矩阵中引入非期望的稀疏性,导致大量冗余计算。本文提出SPIDER系统,首次通过探索稀疏张量核心在模板加速中的应用潜力,将这一未解决的稀疏性问题转化为优化机遇。具体而言,SPIDER引入一种高效优雅的转换方法,整合两项协同技术:面向核矩阵的预编译跨步交换转换,以及面向输入数据的动态行交换机制。这种基于规则的方法能有效将模板计算转换为适配稀疏张量核心的运算,仅引入轻微编译时开销且实现零运行时开销。此外,SPIDER融合多项优化技术以最大化计算效率。实验评估表明,SPIDER平均性能超越厂商库cuDNN 6.20倍,并优于基于张量核心的先进方法(ConvStencil、FlashFFTStencil等)2.00倍。