SPIDER：通过跨步交换释放稀疏张量核心在模板计算中的潜力 (SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping)

Recent research has focused on accelerating stencil computations by exploiting emerging hardware like Tensor Cores. To leverage these accelerators, the stencil operation must be transformed to matrix multiplications. However, this transformation introduces undesired sparsity into the kernel matrix, leading to significant redundant computation. In this paper, we present SPIDER, the first system to turn this unresolved sparsity into an optimization opportunity by exploring the potential of Sparse Tensor Cores (SpTCs) for stencil acceleration. Specifically, SPIDER introduces an efficient and elegant transformation method that integrates two cooperative techniques: an ahead-of-time strided swapping transformation for kernel matrices and an on-the-fly row-swapping mechanism for inputs. This rule-based approach effectively transforms stencil computation into operations compatible with SpTCs, introducing only slight compile-time overhead and zero runtime overhead. Additionally, SPIDER incorporates multiple optimizations to maximize computational efficiency. Experimental evaluations demonstrate that SPIDER outperforms vendor library cuDNN by 6.20$\times$ and state-of-the-art (SOTA) Tensor Core-based approaches (ConvStencil, FlashFFTStencil, etc.) by 2.00$\times$ on average.

翻译：近期研究聚焦于利用张量核心等新兴硬件加速模板计算。为充分发挥这些加速器效能，需将模板操作转换为矩阵乘法。然而，该转换会在核矩阵中引入非期望的稀疏性，导致大量冗余计算。本文提出SPIDER系统，首次通过探索稀疏张量核心在模板加速中的应用潜力，将这一未解决的稀疏性问题转化为优化机遇。具体而言，SPIDER引入一种高效优雅的转换方法，整合两项协同技术：面向核矩阵的预编译跨步交换转换，以及面向输入数据的动态行交换机制。这种基于规则的方法能有效将模板计算转换为适配稀疏张量核心的运算，仅引入轻微编译时开销且实现零运行时开销。此外，SPIDER融合多项优化技术以最大化计算效率。实验评估表明，SPIDER平均性能超越厂商库cuDNN 6.20倍，并优于基于张量核心的先进方法（ConvStencil、FlashFFTStencil等）2.00倍。

相关内容

网络爬虫

关注 13

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常被称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本，已被广泛应用于互联网领域。搜索引擎使用网络爬虫抓取Web网页、文档甚至图片、音频、视频等资源，通过相应的索引技术组织这些信息，提供给搜索用户进行查询。网络爬虫也为中小站点的推广提供了有效的途径。

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

22+阅读 · 2023年5月10日

【CVPR2023】DynamicDet:目标检测的统一动态架构

专知会员服务

26+阅读 · 2023年4月15日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日